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Construct validity (9, 31) is an im- 
portant new concept which has im- 
mediate implications for both psycho- 
metrician and experimentalist. Most 
important is the increased emphasis 
which construct validity places upon 
the role of theory in the validation of 
psychological tests. The aims of the 
present paper are two: (a) to consider 
the directive role of theory in the con- 
struction of psychological tests; and 
(b) to examine certain methodological 
issues which arise from the more ex- 
plicit use of theory in test construc- 
tion. For illustrative purposes we 
have chosen to make a critical analy- 
sis of the Taylor Anxiety Scale (A 
scale) and the research (14, 26, 27, 
28, 30) in which it has been employed 
to establish the independent variable 
of drive (Hull's D). 

The nature of the A scale and the 
results of the studies in which it has 
been used are well enough known so 
that only a brief description is nec- 
essary here. The scale is a self-report 
inventory consisting of 50 manifest- 
anxiety items and 175 buffer items, 
both groups of items taken almost 
entirely from the Minnesota Mul- 
tiphasic Personality Inventory 
(MMPI). The research studies, con- 
cerned with testing the assumption 
that the A scale measures drive level, 
have evaluated the energizing prop- 
erty of D. They have indicated that, 
where the correct response in an ex- 


perimental learning situation has a 
high probability of occurrence, the 
high scorers on the A scale perform 
better than the low scorers. Where 
the experimental situations are such 
that there are competing responses 
or the incorrect responses are equally 
likely of occurrence at the outset, the 
high scorers perform less adequately 
than the low scorers. Both findings 
are consistent with the Hullian as- 
sumption that all habit tendencies 
elicited in a given situation are mul- 
tiplicatively affected by the level of 
drive at the time. These findings 
have provided the basis for inferring 
that the A scale is, therefore, a meas- 
ure of drive. 


Tue Use or THeory In Test 
CONSTRUCTION 

and Meehl state that 
validation takes place 
when an investigator believes that 
his instrument reflects a particular 
construct to which are attached cer- 
tain meanings. The proposed inter- 
pretation generates specific testable 
hypotheses which are a means of con- 
firming or disconfirming the claim” 
(9, p. 290). These authors take as the 
starting point of their discussion the 
presence of an already existing test 
or scale purporting to measure or 
thought to measure a particular vari- 
able. They are concerned with the 
methods of establishing the construct 
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validity of a test after the test has 
been devised. The present authors, 
on the other hand, are concerned with 
the process of devising a scale or test 
so that it will be consistent with the 
procedures of construct validation. 
Our contention is that the test situa- 
tion itself, and the kinds of test be- 
havior it elicits, must be coordinated 
to the theory in exactly the same 
manner as the experiments aimed at 
validating the test. The lowa experi- 
ments with the A scale were designed 
to fit the paradigm required by the 
Hullian framework—i.e., they were 
designed to measure learning, to con- 
trol probability of occurrence of cor- 
rect responses, to control other sig- 
nificant sources of drive variation, 
etc.-in order to make inferences to 
that framework. The same logic re- 
quires that the A scale itself should 
likewise have been designed so that 
performance on it might be a basis for 
inferring drive independently of the 
outcome of subsequent experiments. 

Emphasis on the need for theoreti- 
cal derivation of psychological tests 
may be found in recent work by Peak 
(22) and Butler (5). Their general 
contention is that the theory or the 
properties of the construct should 
determine the nature of the test itself 
as well as the nature of experiments 
which establish the construct validity 
of the test. Peak asserts that “The 
design of objective instruments and 
theory 
about the and rela- 
tionships of any variable to be meas- 
ured..." (22, p. 296). She offers an 
enlightening example of this point: 
“If, for example, [the investigator] 
sets out to devise a measure of hos- 
tility with a knowledge of the psy- 
choanalytic theory of defense mecha- 
nisms, the questions asked and the 
behavior observed will be very dif- 
ferent from that which would seem 
relevant if manifest expressions of 


requires ...a 


procedures 


characteristics 
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hostility were regarded as the only 
appropriate data” (22, p. 247). 

Butler (5) has recently called at- 
tention to the preoccupation of psy- 
chometricians with the formal re- 
quirements of testing at the cost of 
ignoring the role of psychological 
theory in developing tests. He finds 
it astonishing that there is “... no 
personality inventory for which the 
content, the form of the items, and 
the psychometric methods applied 
have been dictated by a formal psy- 
chological model” (5, p. 77). The 
remainder of his article is a program- 
matic effort to use Tolman’s theoreti- 
cal model as the source of hypotheses 
about the nature of psychometric 
items most likely to provide useful 
intervening or independent variables.' 

The important point here, which 
relates this discussion to construct 
validity, is that the psychological, 
or theoretical, model has implica- 
tions for the psychometric, as well 
as for the experimental, procedure. 
It is an artifact of tradition that 
theories have been utilized to derive 
experiments but not to derive tests. 
Yet construct validity makes the 
same set of demands on both the 
psychometric and experimental ap- 
proaches. Each approach requires 
that behavior take place under speci- 
fied and controlled conditions. There 
seems to be no fundamental reason 
why theories should make unequivo- 
cal demands on the experiment and 
permit the test to satisfy psycho- 
metric requirements only. 

The difficulties in moving from 
theories to empirical conditions, or 
from theories to classes of observable 


! Although it is not the purpose of this arti- 
cle to examine examples of tests whose items 
have been derived from theoretical models, 
the reader may refer to one such test which 
will serve as an illustration. The test was de- 
rived by Liverant (20) from Rotter’s social 
learning theory (24) in order to measure the 
construct of need value. 
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behaviors, are, of course, apparent. 
Peak (22) acknowledges that there is 
no simple methodological prescrip- 
tion for meeting the requirements of 
theories. Cronbach and Meehl (9) 
call attention to the absence in psy- 
chology of a formal calculus which 
can provide rigorous implicit defini- 
tions of primitive terms and give 
them empirical meaning. Neverthe- 
less, as they point out (9, p. 294), a 
theoretical network, though admit- 
tedly vague and sketchy, does exist 
and provides constructs with what- 
ever meaning they do have. 

This network, which guides at- 
tempts at construct validity, should 
play also the prior role, we suggest, 
of guiding test development. Such 
procedure would have important im- 
plications for the adoption of strategy 
in subsequent construct-validation 
attempts where the outcome proves 
to be negative and the investigator 
has to decide where to lay blame—on 
the test or the theory—and decide 
which to revise or discard. 


THE DEVELOPMENT OF THE 
TAYLOR ANXIETY SCALE 


With the foregoing considerations 
in mind, we return to an examination 
of the development of the A scale. 
The general question we are asking 
about the A scale is: In what way are 
the form of the scale, the item selec- 
tion procedure, the item content, and 
the nature of the responses elicited 
by the scale coordinated to or derived 
from the Hullian framework as indi- 
Nowhere, to our 


cants of drive. 


knowledge, is this made explicit or is 
a suitable answer to be found; yet 
this is precisely what our point of 
view would demand. 


lorm of the Scale 


The issue in this section lies in the 
coordination between the inventory 
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self-report form of the A scale and 
the Hullian construct of drive. In 
Hullian theory, drive level is coordi- 
nated to both antecedents and conse- 
quents. The antecedents are gener- 
ally conditions, e.g., food deprivation, 
shock, ete., which establish internal 
states that the organism 
avoid. The consequents of drive level 
are activity or level of energy ex- 
penditure. It is clear that the infer- 
ence of drive level from the A scale 
is contingent upon consequents; L.e., 
drive level in this case is a response- 


seeks to 


inferred construct, since no control or 
manipulation of conditions anteced- 
ent to the A-scale responses has been 
Although there has 
been some general criticism of re- 
sponse-inferred constructs (18, 25), 
it is clear that Brown and Farber (4) 
do not consider such criticism fully 
warranted with respect to inferring 
drive. They note (according to Far- 
ber) that while “... more data than 
those provided by the topography of 
a response are needed to enable one 
to identify the extent of its depend- 
ence upon one rather than another of 
its many determinants, this does not 
mean that there are no criteria of 
drive applicable to responses” (12, p. 
26). This is an important statement; 
yet the obvious fact is that such cri- 
teria are nowhere presented in a man- 
ner which would coordinate inven- 
tory self-reports to drive. Their 
statement suggests the future possi- 
bility of reliance on inventories, but a 
query still has to be 
whether the Hullian concept of drive 
can, in terms of its present definition, 


accomplished. 


raised as to 


be at all coordinated to self-report 
verbal responses on any inventory. 


Selection of Items 


If one is concerned only with the 
predictive validity of a test, the mat- 
ter of item content is relatively unim- 
portant, for the empirical item-cri- 
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terion correlations provide criteria 
for the final selection of items. How- 
ever, when a test-developer insists 
(cf. 28, p. 84; 13, p. 324) that his pur- 
pose includes more than the predic- 
tion of a particular criterion perform- 
and that the test items are 
intended to be indicators of a con- 
struct, then item content becomes 
highly important, and item-criterion 
correlations only are insufficient. 

No one can say precisely what the 
specific steps relating empirical oper- 
ations to a construct should be, since 
these must vary with the nature of 
the construct and the intent of the 
investigator. However, it is possible 
to assert that the chain of empirical 
operations should meet at least one 
criterion. This criterion is made ex- 
plicit by Cronbach and Meehl as fol- 
lows: . unless the network... 
lof constructs and hypotheses] ex- 
hibits explicit, public steps of infer- 
ence, construct validation cannot be 


ance 


claimed” (9, p. 291). We take this to 


mean that all the methodological 
links in the development of a test 
must be scrutinized for their “explicit, 
public,” and therefore objective and 
retraceable, character. No test can 
be more objective than the most sub- 
jective link in its development. 
Therefore, test items which are in- 
tended to indicate a construct should 
be selected by rational (rather than 
intuitive) means. This means that an 
item should be scrutinized for its logi- 
cal relationship to a construct and 
that the grounds for choice of an item 
should be explicit and public. The 
difficulty of deriving a series of items 
will depend on the scope, precision, 
content, etc., of the construct to be 
but there should be no 
need to resort to a procedure which 
relies on implicit and private (i.e., 
unexplained) judg- 
ments or ratings. A single explicit 


measured; 


undefended or 
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(and sound) argument for an item is 
better than an implicit rating of an 
item by many judges, because the 
former is a retraceable (and the:cisy 
self-corrective) step while the latter 
is not. (Once the choice of items has 
been made, the empirical criterion 
for inclusion may well be interitem 
correlations since the concept may 
specify a unitary function.) In any 
case, high interobserver agreement 
is no substitute for logical validity. 

This point needs emphasis because 
clinical psychologists and psychia- 
trists are often used as judges in the 
development of tests. Since their 
judgments are usually obtained on 
an intuitive basis (the judges are 
rarely asked to deduce the items ac- 
cording to the logical requirements 
of a concept), a hazard is created (in 
reference to construct validity) which 
cannot be overcome by appeal to 
authority (cf. 16). 

The hazard introduced by the in- 
tuitive procedures usually involved in 
judging is exempiified in the lack of 
explicit relationship between A-scale 
items and drive properties. As 
pointed out above, Taylor's concept 
of drive was intended to be identical 
with Hull's. Yet the procedure for 
selecting items apparently was not to 
scrutinize them for their logical rela- 
tionship to drive. Rather, the items 
were selected on the basis of clinical 
impression of how well they fit Cam- 
eron’s (7) definition of anxiety. But 
why Cameron's definition of a con- 
cept when it is Hull’s concept of drive 
which is to be given empirical con- 
tent? An examination of Cameron's 
definition leads us to conclude that 
there are no reasons for 
choosing his definition rather than 
any other. 

Subsequently the test was shown to 
discriminate to some extent between 
psychiatric patients and normals. 


obvious 
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Taylor reports that, “In an attempt 
to determine the relationship between 
the anxiety-scale scores and manifest 
anxiety as defined and observed by 
the clinician, the anxiety scores for 
groups of normal individuals and 
psychiatric patients were compared” 
(29, p. 290). The empirical situation 
at this point is as follows: The scale 
can now be said to be representative 
of certain clinicians’ judgments about 
patients, i.e., the scale is a quick 
device for reaching the same decision 
as certain clinicians about the mani- 
fest anxiety of patients. Unfortu- 
nately, it is by no means clear what re- 
lationship this classification by clini- 
cians has to the Hullian concept of 
drive. 


Content of the Items 


One might well question how the 
content of various items of the A 


scale can be conceptualized in terms 
of the properties of drive. 


Why 
should answering ‘“‘false’’ to such 
statements as “‘l have very few head- 
aches” and “I am very confident of 
myself’ constitute a referent for 
higher drive than answering them 
“true’’? Why should answering 
“true’’ to an item reporting diarrhea 
and one reporting constipation both 
indicate higher drive than answering 
“false’’ or answering one of them 
“true’’ and the other “false’? An- 
swers to our questions about the 
content of the items would be pro- 
vided whenever a test has been de- 
rived from a theory. Taylor states 
that two assumptions guided the use 
of the A scale: “First, that variation 
in drive level of the individual is re- 
lated to the level of internal anxiety 
or emotionality, and second, that the 
intensity of this anxiety could be as- 
certained by a paper-and-pencil test 
consisting of items describing what 
have been called overt or manifest 
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symptoms of this state” (29, p. 285). 
The first assumption is pertinent 
here. Only if one is willing to equate 
emotionality or internal anxiety with 
level of energy expenditure could one 
accept the use of some of the items. 
This equation itself requires logical 
justification. The actual procedure, 
however, was to have judges select 
items relating to manifest anxiety, 
and a logical gap thus exists between 
manifest anxiety and energy expendi- 
ture. 


Nature of the Responses 


The second assumption Taylor 
mentions is that anxiety or emotion- 
ality may be assessed by a paper-and- 
pencil test in which the subject ac- 
knowledges symptoms of this state. 
Under our discussion of the form of 
the scale we raised questions about 
the theoretical soundness of this as- 
sumption. Here we wish to turn our 
attention to psychometric aspects of 
this assumption. In experimental 
situations we generally observe or 
measure the actual behavior or re- 
sponses on which we base our infer- 
ences. The same is not true in psy- 
chometric measurement. We observe 
or elicit responses (verba!) about 
other responses (nonverbal). A verid- 
ical relationship between verbal and 
nonverbal responses is a fundamental 
requirement of the chain of inference 
involved in the use of the A scale to 
measure drive. Yet the A scale is 
vulnerable to the oft-cited and well- 
substantiated criticisms of self-report 
inventories where the social desira- 
bility or meaning of the item content 
is clear to the respondent. Procedures 
designed to maximize veridicality, 
such as forced-choice items, or to de- 
tect lack of honesty, such as the L 
and K scales on the MMPI, are not 
utilized in the test. (Since the items 
from the latter scales are included 
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among the buffer items, their use is 
apparently left to the discretion of 
As a matter of fact, a 
recent study (10) suggests that the 
Taylor scale may be more susceptible 


the test user.) 


to de eption than are other objec tive 
measures for measuring anxiety. 
Further, the A scale elicits only 
two “true” or 
Since the purpose of the scale is to 


responses: “false.” 
arrive at a measure of intensity of 
anxiety, a scale form providing for 
responses of varying intensity for 
each item would seem preferable. A 
rating scale for each item, or at least 
several response categories ranging 
in degree of agreement or disagree- 
ment with the item, might be more 
appropriate. 

Thus far we have examined the de- 
velopment of the A scale in terms of 
its relationship to Hullian theory. 
The lack of any logical relationship 
raises the issue of interpretation of re- 
One might ask what 
the consequences would have been, 
for either had the 
studies with the A scale yielded nega- 
Cronbach and Meehl 
(9, p. 295) note that the investigator 


search findings. 
theory or test, 
tive findings. 
whose prediction and data are dis- 


make 
sions: he may decide his test is not an 


cordant must strategic deci- 
adequate measure of the construct, 
or he may call into question the net- 
work defining the construct, if he has 
the test. The latter 
phrase is the core of the matter. With 


respect to the A scale, no explicit 


confidence in 


logical basis for confidence in the test 
as a measure of drive existed prior to 
the experiments, and the strategy of 
the investigator, in the face of nega- 
tive results, would undoubtedly have 
been to challenge the test rather than 
Hull's assumptions. The issue, how- 
ever, would be more critical where 
both the test and the nomological 
network are at their inception and 
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neither has been extensively em- 
ployed, because the reason for nega- 
tive findings is equally likely to be in 
the theory or the test. In such cases 
(typical for personality formulations) 
a theoretically derived test would 
yield the advantage of directing fur- 
ther theoretical analysis and develop- 
ment. 


THE CONSTRUCT VALIDITY OF 
THE A SCALE 


We shall next consider the status 
of the A-scale studies within the 
framework of construct validity as 
discussed by Cronbach and Meehl. 
This section will be limited to two 
main points: (a) the degree to which 
diverse consequences of 
the nomological net surrounding the 
construct of drive have been investi- 
gated; and (b) the degree to which 
alternative ‘inferences from the A- 
scale studies have been disconfirmed. 


aspects or 


Validation of Diverse Properties of a 
Construct 


Cronbach and Meehl stress the 
following point: ‘‘ Numerous success- 
ful predictions dealing with pheno- 
typically diverse ‘criteria’ give greater 
weight to the claim of construct 
validity than do fewer predictions, or 
predictions involving very similar be- 
haviors. In arriving at diverse predic- 
tions, the hypothesis of test validity 
is connected each time to a subnet- 
work largely independent of the por- 
tion previously used. Success of these 
derivations testifies to the inductive 
power of the test-validity statement, 
and renders it unlikely that an 
equally effective alternative can be 
offered” (9, p. 295). And, ‘The test 
developer must investigate far-sep- 
arated, independent sections of the 
network” (9, p. 299). Further, in a 
related the establish- 
ment of connections between inferred 


discussion of 
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entities and observables, Beck em- 
phasizes the methodological rule that 
“Each component of the inferred 
entity must be symptomized by some 
datum, actual or available...” (2 
p. 375). 

In a recent paper Farber acknowl- 
edges two components or properties 
of drive—energizing and reinforcing. 
He indicates that a given variable has 
the characteristics of a drive “‘if a) its 
elimination or reduction in magnitude 
is reinforcing, and/or b) it has a gen- 
eral dynamogenic effect upon the 
response tendencies elicited in a given 
situation” (12, pp. 38-39). The bulk 
of animal studies have dealt with the 
former property. None of the A-scale 
studies has investigated this property 

-all have, instead, dealt with the 
latter, energizing, property of drive 
(the multiplicative relation of s//p 
and D). Farber notes that although 
there are difficulties in demonstrating 
the reinforcing properties of manifest 
anxiety, “It is quite possible that this 
sort of demonstration can be accom- 
plished, but to the best of my knowl- 
edge no one has yet done so”’ (12, p. 
27). The requirements of construct 
validation would certainly favor the 
exploration of this “‘far-separated, in- 
dependent section of the network.” 

Other sections of the “‘phenotypic 
space” require investigation, 
particularly the effect of manipula- 
tion of antecedent conditions on the 


also 


A-scale responses themselves. Atkin- 
son (1) asks whether scores on the A 
scale would increase if anxiety were 
experimentally increased. ‘The pos- 
sibility of employing conditions, e.g., 
shock, suggested by other research 
concerned with establishing or in- 
creasing drive, becomes apparent. To 
summarize, the point is that 
struct validity requires investigation 


con- 


of diverse properties of the construct. 
One reason for making this require- 


167 


ment is to lower the likelihood of find- 
ing acceptable alternative inferences 
which can encompass such diversity. 
This leads us to the next major issue. 


Disconfirmation of Alternative Infer- 
ences 

Confirmation of an inference is also 
established to the extent that other 
inferences are not equally applicable. 
Beck states that “Confirmation can 
come only from the disconfirmation 
of all alternative hypotheses through 
the evidential denial of at least one 
consequent of each alternative..." 
(2, p. 377). In the light of this criter- 
ion, the inference of drive from the A- 
Various 
investigators have made alternative 
what the Taylor 
Three of these alter- 
natives will be mentioned here. 

(a) Most prominent is the contro- 
versy raised by Hilgard (17), and by 
Child (8). They consider an equally 
plausible hypothesis to be that the A 
measures only different s/7x's 

than different drive levels. 
Hilgard has concluded that anxiety 


scale studies is not secure. 
inferences about 


scale measures. 


sé ale 


rather 


responses or anxiety-related responses, 
e.g., stronger defensive or avoidance 
habits, can account equally well for 
the data. Certainly, on the face of it, 
the scale measures nothing other than 
differential response systems (assum- 
Farber has been 
explicit in acknowledging an associa- 
tive component in what the A scale 
measures, but he insists that it is the 


ing veridicality). 


drive component which is inferable 
from the research. Overlooking the 
possibility, as suggested by Postman 
(23), that there are no operational 
means for separating these two com- 
ponents, our immediate purpose is to 
indicate that alternatives to the drive 
inference have, at the very least, as 
yet not been disconfirmed. 


(b) Recent studies (6, 15, 19) have 





168 


also suggested another alternative 
hypothesis, namely that the scale 
measures intellectual (habit?) dif- 
ferences rather than drive. While the 
implications of intelligence as an ex- 
planation for the A-scale findings are 
not yet clear, these empirical findings 
should be considered. Certainly the 
obtained correlations between A-scale 
scores and intelligence (if they are not 
fortuitous) are not referable to any 
property of drive as thus far defined. 

(c) Finally, the near-perfect cor- 
relation between the A scale and the 
MMPI psychasthenia scale (3, 11), 
and correlations with other neurotic 
inventories (10), raises further ques- 
tions as to whether amy neurotic in- 
ventory would yield similar experi- 
mental findings, and, if so, in what 
way neuroticism in general is coordi- 
nated to drive level. 

For the test to meet fully the re- 
quirements of construct validity, 
these alternative inferences must be 
disconfirmed. 


Tht CONDITIONAL DEFINITION 
OF ANXIETY AS DRIVE 


To illustrate a further methodologi- 
cal point concerning construct valid- 
ity, let us assume our argument con- 
cerning the A scale to be valid: that 
when it was used in connection with 
the test of the Hullian hypothesis 
that drive energizes gsll,'s, the A 


scale had neither logical nor empirical 


measure of drive. 
This raises the question of whether 
the studies achieved a definition of 
drive or tested the Hullian hypothe- 
SIS. 


foundation as a 


The first step in the experiment 
was to identify high and low scorers 
on the A scale. (Note that no the- 
oretically relevant meaning can be 
given to any score at this point be- 
cause no logical or empirical tie can 
be made to the Hullian concept of 
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drive.) The next step was to perform 
the experiment in the eyelid condi- 
tioning arrangement 
forming were ob- 
tained. The researchers had then ob- 
tained a conditional definition of the 
concept of drive. The definition is of 
this form: if a high scorer is placed in 
a conditioning arrangement, then a 
high scorer has a high drive if, and 
only if, the high scorer conditions 
more rapidly than a low scorer condi- 
tions. Farber puts this as follows: 
... the question of the validity of 
the Taylor scale as a useful definition 
of general drive level is answered by 
the accuracy of the prediction of rela- 
tions between this scale and specified 
behavior variables, under conditions 
such that variation in the behavioral 
variables can be reasonably attrib- 
uted to differential drive levels” (13, 
p. 325). Thus, it appears that the re- 
searchers were attempting to achieve 
a conditional definition of drive. 

The investigators assert, however, 
that the experimental situation pro- 
vided both a definition of drive and a 
test of the Hullian hypothesis under 
consideration at one and the same 
time. This procedure introduces an 
ambiguity. If the results are nega- 
tive, does it mean that the definition 
of the concept is faulty, or the hy- 
pothesis relating the variables? A 
difficulty remains if the results are 
positive, the 
proved only by asserting the defini- 
tion. That is, if we ask how the ex- 
perimenter knows that he actually 
varied drive, he can only reply that 
the results are meaningful if the A 
scale measures drive. However, as 
we have pointed out, the construction 
of the A scale has not been coordi- 
nated to drive, the scale has not been 
employed in testing the diverse prop- 
erties of drive, nor have ‘‘reasonable”’ 
alternative inferences about the A- 


Results con- 
to expectations 


“e 


since hypothesis is 
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scale scores been disconfirmed. There- 
fore the experimental results alone 
are not sufficient support for the as- 
sumption that the A scale measures 
drive. 

The problem of the definition ‘of 
drive is directly analogous to the 
problem of the definition of reinforce- 
ment. Meehl (21) points out that the 
reason why the Law of Effect is not 
circular is that conditional definitions 
of reinforcers are made independently 
of the test of the Law of Effect. For 
example, Meehl presents the follow- 
ing “special law’’: “On schedule M, 
the termination of response sequence 
R, in setting S, by stimulus S! is fol- 
iowed by an increment in the strength 
of S.R.” (21, p. 60). In the studies 
involving the A scale, however, M 
(the A scale) is not defined by the in- 
vestigators in terms of an independ- 
ently observed ‘‘schedule,”’ but only 
in terms of the response sequence in 
the experiment. 


When a construct implies a rela- 


variables, these 
variables must be designated inde- 
pendently of any test of that relation- 
ship. 


tionship between 


SUMMARY 


Construct validity emphasizes the 
directive role of theory in test valida- 
tion; the intent of this paper has been 
to emphasize the directive role of 
theory in the construction of psycho- 
logical tests. 

Our position is that the psycho- 
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metric, as well as the experimental, 
procedure must be coordinated to the 
hypothetical properties of the con- 
struct to be measured. In this way 
the test situation is made parallel to 
the experimental situation—the con- 
ditions of both being clearly derived 
from theory, and the behavior elicited 
in both being clearly relevant to the 
theory. 

The above points, as well as cer- 
tain methodological issues arising 
from the explicit use of theory in test 
construction—the investigation of 
diverse properties of a construct, dis- 
confirmation of alternate inferences, 
conditional definitions—-were  illus- 
trated through a critical examination 
of the Taylor Anxiety Scale. Our 
conclusion was that the A scale has 
only a tenuous theoretical and em- 
pirical coordination to the Hullian 
construct of drive. The experiments 
which have relied on the A scale may 
be considered to have attempted thus 
far a conditional definition of drive 
rather than to have demonstrated 
the hypothesis that drive energizes 
habits. 

Our intent has not been to single 
out a particular test for criticism. We 
recognize that much work is being 
done on further construct validation 
of the A scale. Such work, we hope, 
will answer some of the questions we 
have raised, questions which we feel 
are of importance for psychometrics 
as a whole. 
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Although the need for sound re- 
search on the outcome of mental dis- 
ease has long been recognized, and 
although probably no other medical 
specialty has accumulated such an 
abundance of statistics as psychia- 
try, yet the dearth of adequately de- 
signed investigations especially in the 
field of the evaluation of therapy is 
conspicuous. The literature describ- 
ing and appraising the results of in- 
sulin, metrazol, and_ electroshock 
treatment is voluminous. Reports on 
lobotomy results have also mounted 
steadily. In spite of these studies the 
satisfactory evaluation of therapy 
still remains one of the most trouble- 
some problems of psychiatric research. 
In reviewing the vast literature ol 
therapeutic results one finds conflict- 
ing reports ranging from severe skep- 
ticism of the various therapies to in- 
ordinate enthusiasm for them. Con- 
sequently every clinician can cite a 


study to support his particular view- 
This 
state of affairs suggests the need for a 
critical study of the methodology of 
these evaluation studies with a view 


point on any given therapy. 


towards the improvement of research 
design in this area in the future. 


THE PROBLEM 


It would be a Herculean task to at- 
tempt to record and to review all the 


! The present publication has resulted from 
a study undertaken by this author during her 
tenure of a fellowship granted by The Fund 
for the Advancement of Education. The fa 
cilities of Project M586 of the National Insti 
tute of Mental Health aided in the prepara 


studies that have been done on the 
evaluation of the various somatother- 
apies, not only because of their ex- 
cessive number but also because their 
results, so diversely presented, defy 
organized classification according to 
any one uniform plan. It is, there- 
fore, our intention to present a fairly 
researches 
with their results and the techniques 
by which these have been derived. 
The purpose ol this study is to ex- 
amine the available and analyzable 
data on the outcome of the shock 
therapies and psye hosurgery in order 


representative group of 


to get an estimate of the effectiveness 
of these therapies in the treatment of 
schizophrenia, as it has been reported 
in the literature. As a result, four 
types of data have been analyzed: 
(a) outcome of nonspecific treatment 
of all psychoses and schizophrenia 
during the preshock (pre-1930) pe- 
riod; (b) outcome of nonspecific treat- 
ment of schizophrenia during the pre- 
shock period, as reported after shock 
therapy became available; (c) out- 
come of insulin, electroshock, metra- 
zol, and psychosurgical therapies; 
and (d) results of comparative studies 


of treated and control groups. 


Outcome of Nonspecific Treatment of 
All Psychoses and Schizophrenia 
During the Preshock (Pre- 

1930) Period 


In the beginning of the 20th cen- 
tury both in America and abroad, 


tion of the material on which this study is 


based 
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interest was directed toward evaluat- 
ing the outcome of mental disease. 
Practical considerations, such as the 
cost of care of the mentally ill to the 
community, as well as scientific curi- 
osity prompted hospital adminis- 
trators and doctors to investigate the 
results of hospitalizing and caring for 
mental patients. Most studies con- 
sisted in the follow-up of total hospi- 
tal populations over a period of sev- 
eral years after admission to deter- 
mine the ultimate disposition of these 
cases. Bond, Fuller, and Pollock were 
pioneers in this research. All three 
studied the outcome of mental dis- 
ease and all three were unanimous in 
finding a low rate of improvement in 
dementia praecox patients.* We shall 
review the work of Bond and Fuller in 
considerable detail. 

In several studies Bond (6, 7, 8, 9) 
reported his findings on the then cur- 
rent results for use as a base line or 
standard by which new therapeutic 
measures could be judged as they 
were developed. Recognizing the 
scarcity of follow-up results in psy- 

*It would be interesting to determine 
whether the uniformly low rate of improve- 
ment in these early studies was a consequence 
of the more narrow definition of dementia 
praecox which Kraepelin introduced, rather 


than the wider definition introduced later by 
Bleuler under the term “schizophrenias."’ 
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chiatric work as compared with surg- 
ery, he compiled the data which he 
had collected mainly at the Pennsyl- 
vania Hospital, a private institution. 
Bond observed that usually the pa- 
tients were committed to the Hospi- 
tal after many opportunities for early 
treatment had been lost because of 
the families’ tendencies to procrasti- 
nate. The summary of Bond's early 
follow-up studies on heterogeneous 
groups of patients in the preshock 
period is presented in Table 1. 

In general Bond considered his 
findings encouraging and observed 
that, although the cases came late for 
treatment, if the psychiatrist could 
still produce recovery in approxi- 
mately 25% with about at least 15% 
ameliorated, then he might rightfully 
feel gratified. 

From Table 1 it can be seen that 
the recovery and improvement rate 
combined is between 40% and 50%. 
Bond maintained that these good re- 
sults with mental patients should be 
emphasized to counteract the then 
(1920's) popular impression that men- 
tal patients always become worse and 
that even if they seem to recover 
they soon break down again. A point 
that one cannot fail to note here in re- 
viewing Bond's results is the heter- 
ogeneity of the patients. All ages, all 


TABLE 1 


Bonp's FINDINGS ON THE OUTCOME OF MENTAL DISEASE IN PATIENTS 
Hospitalized DuRING THE PresnHock PERiop* 


Patient 
Study - 
Type N 


Year 


Bond (6) Vi 
Bond (7) Vv? 
Bond (8) vi 
Bond (9) ve 


1921 
1921 
1923 
1925 


Note 


111 

251 

377 
1024 


Duration of 
Folk w-up ~ 
(P.Ad.) g U 


Results (%) 


5 years 3x0 
5 years : 25 
5 years 30 
5 years 25.5 


V' includes DP, MD, SA, GP, U, and C patients 


V? includes DP, MD, IM, SA, D, G, OBD, E, A, P.som., Pa, PN, Mo-h, P.inf., PsPath 
V? includes SA, GP, MD, DP, OBD, and P.peil. patients. 
V* includes MD, DP, GP, SA, IM, PN, Pa, PsPath, P.som. patients 

* For glossary of abbreviations used in this and in succeeding tables see Key to Tables in Appendix. 
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types of diseases, and individuals with 
illnesses of varying durations are 
represented here, but these are fol- 
lowed up for a long period. While 
Bond was interested primarily in the 
general outcome for all mental dis- 
eases, he mentions specifically that 
in his groups the most unchanged by 
treatment were the dementia praecox 
cases. The above tabulated studies 
are important for they were later 
used in several studies as base lines 
in evaluating some of the results of 
the shock therapies in the 1930's. 

In addition to Bond, Fuller also 
studied the outcome of mental illness. 
Among his many surveys during the 
preshock period, he reported in 1930 
on the expectation of hospital life 
and outcome for mental patients for 
first admissions to mental hospitals 
(24). This extensive survey furnished 
data on a wide variety of psychoses 
as well as a report on all psychoses. 
He included 1,200 patients in each 
group of psychoses such as Dementia 


Praecox or Manic Depressive, and 
2,400 in the “All Psychoses” group. 
Fuller presented his statistics in a 
particularly effective way, giving the 


results in terms of patients dis- 
charged, dead, and still hospitalized 
at various time intervals up to 15 
years. Fuller's findings regarding the 
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outcome of first admissions to a men- 
tal hoSpital are presented in Table 2. 

In Fig. 1 the percentage discharged 
and not later readmitted for All Psy- 
choses and for Schizophrenia is pre- 
sented. It will be noted that from 3 
months to 15 years there is a rather 
steady rise in the percentage dis- 
charged and not later readmitted for 
both groups, the rate being slightly 
higher for All Psychoses than for 
Schizophrenia. At the end of 5 years, 
29.9% in Schizophrenia and 36.9% 
in the All Psychoses categories re- 
spectively are discharged and not 
later readmitted. At the end of 10 
32.2% of the Schizophrenic 
group and 39.3% of All Psychoses 
are discharged and not later read- 
mitted, while after 15 years the per- 
centages are 35.3 and 40.9 respec- 
tively for Schizophrenia and All Psy- 
choses. 

In another survey (25) based on 
11,050 patients admitted over a two- 
year period to the civil State hospitals 
of New York, Fuller found that out 
of every 1,000 patients representing 
first admissions of all diagnostic cate- 
gories 87.1% were hospitalized only 
once, while 12.9% had more than one 
hospitalization. In 1931 in studying 
the duration of hospital life for men- 
tal patients, Fuller (26) reported the 


years, 


TABLE 2 


FuULLER's FINDINGS ON EXPECTATION OF HospiTAL LIFE AND OUTCOME OF 


MENTAL PATIENTS ON Fikxst ADMISSION 


—EEE 


MD 


DP 


AP 
Percentages Based on N ~1,200 Percentages Based on N —1,200 Percentages Based on N & 2,400 


Follow up 


Period Not 


Later 
Read 
mitted 


Later 
Read 
mitted 


In Dead 


3 mos 6 
6 mos 

9 moa, 

1 yr 

2 yrs. 

3 yrs 

4 yrs 

5 yrs 
10 yre. 
15 yrs. 


enon onrunenw 
—“RORDmRweOuUnw 
Mee Unecooen 


wn oe Onuweaen 


Not 
Later 
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Later Not 


Read In Dead In Dead 


~-a2~ 442 Doe 
weCnenoure®~o@ 
An Bw Oeeen~ 

NewUVFaue= 
Nenenounse 


eFoCOCOo ee eeow 
re 





174 


outcome for three groups of men- 
tal patients: one group admitted to 
New York State civil hospitals be- 
tween 1909 and 1911 and observed 
for 16 years; another admitted from 
1914 to 1916 and followed for 11 
years; and the last admitted from 
1919 to 1921 and followed for 6 years. 
In Table 3 Fuller's findings for these 
groups are presented. He gave the 
results for the individual psychoses 
and then presented the outcome for 
all psychoses as a group. Here as in 
the previous table we present his find- 
ings for manic depressive, dementia 
praecox, and the total of all psy- 
‘ hoses. 

Just prior to the introduction of 
shock therapy in the United States in 
1935, Fuller (27) published another 
report on the outcome of mental dis- 
ease in 947 patients discharged from 
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the civil State hospitals of New York 
during the decade following their dis- 
charge. As a result of his investiga- 
tion he was able to estimate that dur- 
ing a ten-year period, out of each 100 
patients discharged from the civil 
State hospitals of New York, 55 
would be living in the community, 21 
would be resident again in a mental 
hospital and 23 would have died 
either in the community or in a men- 
tal hospital and 1 out of the total 100 
would be located in some type of in- 
stitution other than a mental hospi- 
tal. A summary of his findings is pre- 
sented in Table 4. 

Once again, as in the case of the 
Bond studies, the groups used by 
Fuller were heterogeneous. Bond's 
recovered and improved category in 
Table 1 is quite similar to Fuller's 
percentage in the community after 


ALL PSYCHOSES 


SCHIZOPHRENIA 





4 $ ry 


DURATION OF 


a $ 10 ii 42 13 14 15 


FOLLOW-UP: YEARS 


Fic. 1. Fuctcer’s FinpinGs oN OuTCOME OF MENTAL PATIENTS 
ON First Apmission (1930) 
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TABLE 3 


FuLLer's FinpinGcs (1931) ON THE DURATION OF HospPITAL 
Lire FoR MENTAL PATIENTS 


Duration of 

Follow-up - 

(P. F. Ad.) 
1,579 
2,481 
11,050 


16 years 


1, 868 
3,549 
12,550 


11 years 


1,873 
4,119 27 


AP 13,473 3] 


ten years in Table 4. The unim- 
proved and death rates also tend to 
be similar. 


Outcome of Nonspecific Treatment 
of Schizophrenia During the Pre- 
shock Period, as Reported After 
Shock Therapy Became 
Available 

In 1936 with the advent of the 
shock therapies to the United States, 
the need for comparative norms based 
on nonshock patients in the evalua- 
tion of the results of shock therapy 
became more acute. As psychiatrists 
with the new techniques 
searched about, they found only a 


working 


Results (%) 


In Hosp 
at End 


U D (In Hosp.) of Period 


18 
26.! 


limited number of studies with which 
to compare the new results (except for 
those of Bond, Fuller, and Pollock), 
especially for homogeneous groups, 
for example, 
Therefore, in the late 1930's several 
studies evaluating results during the 
preshock period were instituted 
Table 5 presents the 
group of such studies on the outcome 
of mental disease in patients, mainly 
schizophrenics, hospitalized during 
the preshock period. In appraising 
these results it must be realized that 
during that received 
mainly routine hospital care, or what 


as. schizophrenics. 


results of a 


era patients 


is frequently referred to now as non- 


TABLE 4 


Futier'’s Finpines (1935 
THE Civ STATI 


ON OutTCOME OF 947 PATIENTS DISCHARGED FROM 
HospitaLs or New York TEN Years 


APTER DISCHARGE 


In Community In 
After Ten 
Years (P.Dis.) 


[ype 


MD 327 
DP 242 
AOP 378 
AP 947 


56.6 
43.8 
59.5 
54.5 21 


Note 


AP includes all psychoses 


Hospital 


19.9 
43.3 
8.7 
; 23.4 


AOP includes CA, Gpl, CS, OBND, A, D, P.som 


Died In 
During Community 
Pe riod 


Readmitted 
to a Mental 
Continuous Hospital 
22.6 37.3 47 
12.8 36.4 55 
31.0 48.1 31 
41.6 43 
MD, IM, DP, F Mdef, Ud 


Pa, PN, N, PePath 
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specific treatment. Thus, the recov- 
eries that occurred under such condi- 
tions are called, according to present 
standards, ‘‘spontaneous remissions.” 
Some psychiatrists, however, like 
Malamud (52) and Solomon maintain 
that there is no such thing as “‘spon- 
taneous remission."” They contend 
that every patient gets something out 
of his hospital stay and that some- 
thing is what helps in recovery. 
Therefore, they claim that everything 
that is done for the patient is in one 
sense or another treatment, whether 
or not it is meant to be treatment by 
the doctor who administers it. 

In opposition to this viewpoint, 
there are those investigators like 


Stunkard (76) who maintain that in 
order to evaluate therapeutic effects 
of a specific therapy it is necessary to 
contrast it with the expected spon- 
taneous improvement. To prove the 
effectiveness of a particular therapy 
one should be able to demonstrate, 


other factors remaining constant, 
that the patient makes more progress 
with the therapy than without it. 
The essential problem in this “‘spon- 
remission’’ controversy re- 
volves around the significance of the 
term ‘‘treatment.’’ These studies of 
spontaneous remission seem to be 
measuring the same factors as the 
studies of the so-called nonspecific 
treatments. It should be mentioned 
here that ordinarily nonspecific treat- 
ment includes hydrotherapy, recrea- 
tional therapy, occupational therapy, 
physical therapy, and even brief in- 
terviews with physicians. 

In Table 5 we have included the 
remission” and the 
nonspecific treatment studies. A re- 
view of these investigations reveals 
very little specification on the part of 
the authors of the type of patients, 
duration of illness, duration of follow- 
up and of other relevant factors nec- 


taneous 


“spontaneous 
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essary to compare different studies. 
The earliest study listed in the table, 
that of Bond and Braceland, reports 
the outcome based on a heterogene- 
ous group. The later studies were 
based exclusively on patients suffer- 
ing from schizophrenia, a disease so 
well represented in all mental hos- 
pitals that it offered the greatest chal- 
lenge to the new therapies. Compar- 
ing the over-all results with the re- 
sults for the schizophrenic group in 
the Bond and Braceland study, one 
finds a much higher recovery rate for 
the heterogeneous group. Improve- 
ment in the schizophrenic group is 
fairly low in all studies. Generally 
speaking, most of these studies indi- 
cate about a 30-40% improvément 
rate maintained upon follow-up by 
these so-called spontaneous remis- 
sions or by those who have had non- 
specific treatment. Similar results 
have also been reported from abroad 
by Neumann and Finkenbrink (60) 
who found 32.9% social remissions 
after a twenty-year follow-up. The 
chief difficulty in appraising these re- 
sults is that most workers have not 
indicated the duration of the psycho- 
sis at the time that the patients came 
for treatment. From the few studies 
in which the duration was indicated, 
it can be seen that there is consider- 
able variability in this factor in the 
different studies, as in the case of 
Gelperin’s (28) group which included 
both acute and chronic cases. Such 
variability makes comparisons very 
difficult. However, as we shall see 
presently, the shock period studies of 
the somatotherapies likewise included 
individuals who had been ill for vary- 
ing lengths of time. The statistics 
that have been presented in the fore- 
going tables offer only an over-all 
view. 

In summary then, it may be said 
in respect to the nonspecific treat- 
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TABLE 5 
OvuTCOME OF SCHIZOPHRENIA IN PATIENTS RECEIVING NONSPECIFIC TREATMENT, 


AS REPpoRTED DURING THE SHOCK PERIOD 


Patient 
Study —_—__—__ 


N 
Rennie (66) 
222 


Malamud & Render (52) 177 


Rupp & Fletcher (70) 608 


Hunt, Feldman, & 
Fiero (37) 


Cheney & Drewry (13) 
Whitehead (80) 


Bond & 
Braceland (11) 


Gelperin (28) 


Guttman, Mayer-Gross, 
& Slater (30) 


Romano & Ebaugh (68) 
Whitehead (80) 90 
Cheney & Drewry (13) 500 


Rennie (66) Se 500 


Results (%) 


Duration of 
Follow-up 


R, MI, I U 

Life 11.00 89.00 
20 yrs. 38.29 71 
9 yrs. §2.23 47.76 
5-9} yrs. 32.00 58.00 


44-10 yrs. 21. 63.50 


34-104 yrs. 35. OA. 
2-12 yrs. 41 59. 
5} yrs. 51. 43 


5 yrs. 53. 25. 
5 yrs. 31. 56 


5 yrs. 40. 54 


2-4 yrs. 42 
Up to 4 yrs 23.:! 
6-18 mos. 36 
Im 37 


Im 41.08 57.34 


* Includes DP, MD, GP, IM, P.som., SP, A, PN, Pa, U, PsPath., En 


t Of the 500, 486 were discharged; 5 died in the hospital and 9 remained in the hospital 


subsequently followed up 


These 486 were then 


~ Of this group, 10% died; 43% continued in the hospital, and 47% were living outside 
§ Of the 641, 10.8% died (69), most of them unimproved 
Indicates either failure to report or reporting in a manner that could not be tabulated here. 
P py 


ment of psychotics that studies with 
a brief follow-up of one year or less, 
such as those of Fuller, using heter- 
ogeneous groups indicate discharge 
rates and recovery rates of about 27%, 
increasing with the increase in dura- 
tion of follow-up. The rates for DP 
cases are lower, less than 20%. On 
fifteen-year follow-up Fuller found 
a discharge rate of 35.3%, while the 
over-all discharge rate for all psy- 
choses (for those not later readmit- 


ted) was 40.9%. The early studies of 
Bond with heterogeneous groups are 
similar to Fuller's. Including the 
young and old, the functional and the 
organic cases, Bond obtained a recov- 
ery rate of about 25% for five-year 
follow-up and a total improvement 
rate of about 40% (including recov- 
ery and improvement). 

In Table 5, one heterogeneous 
study, that of Bond and Braceland, 
showed a recovery rate of 35% after 
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five years and a total improvement 
rate (recovered and improved) of 
53%. Most of the other studies, all of 
which have been with homogeneous 
groups—schizophrenics (see Table 
5), have indicated that about a 40% 
improvement rate may be expected 
over a five-year period. The tremen- 
dous variation in all of these studies in 
respect to follow-up, duration of ill- 
ness, and the like, makes strict com- 
parisons impossible. 


Outcome of Insulin, Electroshock, 
Metrazol and Psychosurgical 
Therapies (Without Controls) 

The introduction of the shock ther- 

apies and then, later, of psychosur- 
gery was heralded by interested and 
hopeful psychiatrists as a great ad- 
vance. The initial enthusiasm for in- 
sulin, metrazol, the electroconvulsive 
as well 
gradu- 


therapies, and psy hosurgery 
has, however, been giving way 
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ally to more caution in most circles 
since the mid-1940’s at which time 
the five-year follow-up results of the 
shock therapies and psychosurgery 
began to appear. 

In general, the studies on the out- 
come of these specific somatothera- 
pies can be considered under two cat- 
egories, depending on whether or not 
control groups have been used. The 
first group which we are discussing 
here simply reports the outcome of 
the particular treatment used. Such 
studies as a rule simply indicate the 
immediate and/or follow-up status of 
patients treated with the given ther- 
apy. No attempt is made to use a 
control group. The general implica- 
tion seems to be that the patients 
would have been worse if the particu- 
lar treatment had not been employed. 
A group of such studies is presented 
in Table 6. 


TABLE 6 


STUDIES ON THE OUTCOME OF 


Patient Duration 
Study of 
Type Iliness 


Palmer & 
Braceland (64 


MD-m 
MD 4 
MD.-ag.d 
& 


IM 


PN 


Bateman & Se 
Michael (3 ~ 


Halpern (31) Ss 


Bond & Rivers (12) Va 


F pstein (17) 


6-28 moe 


Fitzgerald (20) 


Impastato & Va 


Almansai (40) 


Malzberg (54) 


6 mos.-5 yre 


Speciric SOMATOTHERAPIES (Wi1THOUT CONTROLS) 


Duration Results 
oO 


Follow-up 


Type 


Therapy R, MII U 


Im 87.50 
Im 66.67 
Im 50.00 
Im 54.60 
Im 66.67 
Im 00 


PNa 


Im 71.90 
61.80 


00 
13 


Im 


16 


00 


50 
Bi) 








Study 


Hoistatter et al. (34) 


Kindwall & 
Cleveland (44) 


K walwasser & 


Robinson (45) 


Smith et al. (74) 


Oliman et al. (63) 


Paster & 


Holtzman (65) 


42) 


Kane et al 


Bennett (4) 


Wilson (81) 


Malamud et al, (51) 


Zond & Rivers (12) 


Bennett & Wilbur (5) 


Freeman & Watts (22 


Holt (35) 
MacKinnon (50) 
Morrow & King (59) 
Mocre et al. (58) 


Martin (55 


Stengel (75) 
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TABLE 6 (continued) 
Duration Duration 
ol ol 
Illness 


Patient 


Follow up 


Type 


Sx 10 yrs.* 
MI 
PN 
0 


Im 
) 


AP 
Vv 


DP 


IM 
MD.-d4 
MD m 


22 mos. 
8-10 mos 
Us 


2 wks.-2 yrs 
d 1 mo.-10 yre 
’N s Ch 


‘ 7.4 yrs.* 
O, AfP, Pa 
AP 

V (majority 


} 


1 wk.-3 yre 3-18 mos 
Im 
6 mos 


13 mos 


<2 yre. 9-21 mos 


Va mos 
yr 
yrs 
yrs 
yrs. 
22 mos 63 mos 
Va 2 mos 
9 yrs 


2 yrs 
3 yrs. 
10 yrs 


yr 


MD-mix. 
IM 
PN 


S< 


200 Va 1 yr 


I ype 
ot 
Therapy 


18 
41 
‘4 


of 
67 
66 
79 


KO 


00 
Oo 
00 
00 
00 
oOo 
00 
on 
00 
00 


33.00 


Results 


R, ML I { 


10.00 


— Indicates either failure to report or reporting in a manner that could not be tabulated here. 
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i 


3 + 3 


DURATION OF FOLLOW-UP : YEARS AFTER ADMISSION 


(NON- SPECIFIC AND CONTROLS) 


Fic. 2. Ourcome or MENTAL PATIENTS GrvEN NONSPECIFIC TREATMENT (PRESHOCK) 
AND oF THOSE Usrep as CONTROLS (SCHIZOPHRENICS) 


Results of Comparative Studies with 
Treated and Control Groups 


While the studies just described 
were without controls, others have 
made an attempt to compare un- 
treated groups with those treated 
with a specific somatotherapy. In 
these investigations control groups 
have been assembled from current 
cases for the particular research in 
progress, or control data have been 
obtained from the preshock records of 
the investigator's own practice or 
hospital. Table 7 presents a repre- 
sentative group of studies containing 
treated groups as well as untreated 
control groups. 


Graphic Analysis of Previously 
Tabulated Results on Somato- 
therapies 


In order to analyze the results of 
the previously tabulated studies, spe- 
cifically for schizophrenia, a series of 
graphs was made. Figure 2 shows the 
percentage of schizophrenics recov- 
ered, much improved, and improved 
at follow-up intervals up to 5 years 
after admission. The findings of in- 
dividual studies and control studies 
have been shown in the graph, but 
the lines are drawn through the aver- 
ages for nonspecific treatment at each 
follow-up interval in the one case, 
and through the averages for the con- 
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AVERAGE OF ECT STUDIES 

AVERAGE OF METRAZOL STUDIES 
AVERAGE OF PSYCHOSURGERY STUDIES 


Fic. 3. OuTCOME OF MENTAL PATIENTS TREATED WITH VARIOUS 
SOMATOTHERAPIES (SCHIZOPHRENICS) 


trols at each follow-up interval in the 
other. One exception should be 
pointed out. It will be noted that the 
nonspecific treatment study at the 
two-year level (a single study 
23.57% recovered and improved) 
drops unusually low and out of line. 
The point is indicated in the graph, 
but omitted from the line of averages. 
In general the trend is for the non- 
specific treatment groups to show a 
recovered, much improved, and im- 
proved rate of about 40% at the end 
of five years, a finding in conformity 
with those of Bond and Fuller. It is 
however, that 
with the single exception of the one- 


interesting to note, 
year follow-up level, the controls are 
always poorer than the nonspecific 
groups, showing a recovered, much 
improved, and improved rate of only 


about 25% at the end of five years. 
One may ask why the controls used 
in the shock era should be so differ- 
ent. One reason may be that in the 
shock era only poorer patients, that 
is, those with unfavorable prognoses, 
were available as controls, other pa- 
tients being given the benefit of the 
specific treatments. There is also a 
second possibility, namely, that 
chronic, deteriorated cases may have 
been selected as controls. In any case 
the outcome for the controls is strik- 
ingly different from that for those 
given nonspecific treatment in the 
1930's. The general findings of these 
studies of nonspecific treatments and 
spontaneous should be 
kept in mind as we turn now to the 
results that have been reported for 
the specific somatotherapies—insulin, 


remissions 
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Fic. 4. Ourcome or MENTAL 


PATIENTS TREATED WITH 
ComPrarep with Outcome or MENTAL PATIENTS GIVEN NONSPECIFIC 


VARIOUS SOMATOTHERAPIES AS 
TREATMENT (PRE 


sHOocK) AND THose Usep As ContrRoLs (SCHIZOPHRENICS) 


metrazol, electroconvulsive therapy, 
and psychosurgery. 

Next, the outcome of schizophrenics 
treated by the various somatothera- 
pies, that is, insulin, metrazol, elec- 
troshock, and psychosurgery, was an- 
alyzed. These results are presented 
in Fig. 3. The percentage of schizo- 
phrenics recovered, much improved, 
and improved in the individual 
studies at varying follow-up intervals 
after the termination of treatment, 
up to five years, is shown in this 
graph. As in the previous graph, the 
averages at each interval of follow-up 
have been computed. The insulin 
average is indicated as J, electro- 
shock as /, metrazol as M, and psy- 
chosurgery as P. The averages for 
each treatment at each interval con- 
stitute the points through which the 
lines are drawn to represent the out- 
come for each type of treatment. The 
graph reveals considerable variability 
and overlap among the different types 


of somatotherapies as wel! as a dearth 
of studies for longer follow-up pe- 
riods. Most studies on the outcome 
of. therapy report only immediate 
outcome and very few go beyond the 
one-year period. At five years all 
outcome results for the somatother- 
apies are poorer than immediate out- 
come. 

In Fig. 4 the lines shown in Fig. 2 
and 3 have been combined. Figure 4 
shows the average outcome results 
for schizophrenics treated by each 
somatotherapy, for the nonspecific 
treatment and for the controls. The 
average results for all insulin studies 
are plotted as J at each interval, 
metrazol as M, electroshock as EF, and 
psychosurgery as P, nonspecific as N, 
and the controls as C. The lines 
drawn through these average points 
at the various follow-up intervals in- 
dicate in all cases better immediate 
outcome for treated patients (about 
50-60%) than for the nonspecific and 
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controls, but this advantage is not 
maintained for the treated group on 
follow-up after five years. Whereas 
the nonspecific group never shows 
such striking recovery and improve- 
ment rates, the treated groups show 
more relapses with time, dropping 
toward the nonspecific rate of about 
40% after 5 years following treat- 
ment, but never dropping as low as 
the controls. 

When all the treatments are aver- 
aged into a single line and when the 
nonspecific and controls are included, 
we get the results shown in Fig. 5. 
For comparison purposes we have 
also drawn in the nonspecific and con- 
trol averages. In the graph the letters 
I, P, E, M, N, and C indicate the 


average results for all studies of insu- 


(WON SPECIFIC AND CONTROLS) 


5. TREATED V8. UNTREATED PATIENTS (SCHIZOPHRENICS) 


lin, psychosurgery, electroshock, met- 
razol, nonspecific treatment, and con- 
trols at a given follow-up period (for 
the somatotherapies from the date of 
termination of treatment, and for the 
nonspecific and controls from the day 
of admission). It will be immediately 
observed that the recovered, much 
improved, and improved rate among 
treated schizophrenics tends to de- 
cline as the period after treatment in- 
For the untreated the course 
is more variable, but when the non- 
specific treatments are considered 
alone, they are slightly better than 
the treated after 5 years, with a much 
more even course than the treated 
during the five-year period. 
Although from these results it 
would appear that somatotherapy 


creases. 
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Fic. 6. INcipENCcE OF DeEatHs AMONG SCHIZOPHRENIC PATIENTS 


for schizophrenics offers little advan- 
tage when long-term follow-up results 
are considered, yet another aspect of 
the problem needs to be considered, 
that is, the incidence of deaths among 


the treated and the untreated. Fig- 
ure 6 presents the percentage of schiz- 
ophrenics dying in the untreated 
and the treated groups. during fol- 
low-up periods, in the case of the 
treated from the date of the termina- 
tion of treatment, and in the case of 
the untreated (nonspecific and con- 
trol patients) from the day of admis- 
sion. The average deaths on follow-up 
for patients who had been treated 
with insulin, metrazol, psychosur- 
gery, and electroshock are shown in 
Fig. 6. The lines representing aver- 
age nonspecific and control deaths 
are also shown. In this graph we can 
see immediately that the death rate 


rises steadily in the untreated groups 
(nonspecific and controls). Fewer 
patients die in the treated than in the 
untreated groups. (It should be noted 
too that psychosurgery contributes 
considerably more to the death rate 
than the other therapies.) The saving 
of life which treatment offers cannot 
be overlooked, even in the face of 
failure to produce high recovery and 
improvement rates.* 

The data presented in these graphs 
do not permit us to draw any definite 
conclusions as to the relative merits 
of any one specific therapy in the 
treatment of schizophrenia, since the 


* The conclusion that the death rate among 
the treated is lower than that among the con- 
trols is only tentative, since it was impossible 
to obtain information regarding age-specific 
and sex-specific death rates for the treated and 
the control groups. 
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evidence is scattered and since, more- 
over, not all studies on somatother- 
apy which we have found in the lit- 
erature could be graphed in this way 
(largely because of insufficient in- 
formation as to follow-up, etc.); 
nevertheless, they shed some light on 
the inadequacies of the studies that 
have been done and they indicate the 
need for improvement of the research 
in the evaluation of the somatother- 
apies (33). 


AN APPRAISAL OF THE FOREGOING 
STUDIES IN TERMS OF 
METHODOLOGY 


The methodology of the studies 
listed in Tables 6 and 7 will be con- 
sidered in terms of the four essentials 
previously set by one of the present 
authors (84) as minimum essentials of 
adequate research design: homo- 
geneity, control groups, follow-up, 
and specific criteria for evaluating 
outcome. 


Homogeneity 


It can be seen from both Tables 6 
and 7 that some investigators con- 
tinue to include all types of mental 
diseases in their investigations with- 
out specification. While this practice 
is acceptable if separate outcome re- 
sults are presented, some give only 
the total results (44, 58). Very fre- 
quently acute and chronic cases are 
included in the same study, and total, 
rather than separate, outcome results 
are reported. In most cases the length 
of the illness before treatment is not 
even mentioned. While most investi- 
gators usually give adequate identify- 
ing data as to the age, sex, and num- 
ber of patients, an occasional study 
does not even clearly specify the dis- 
ease entity under investigation. From 
the point of view of research design in 
general, however, homogeneity is 
probably the feature least open to 


187 


criticism in modern evaluations of 
the somatotherapies. 


The Use of Control Groups 


This criterion of research design is 
often not met at all or only very 
poorly satished (15). In Table 6 
there are a large number of studies, 
no one of which has used any con- 
trols. This type of study is commonly 
found in the literature. The immedi- 
ate or follow-up results of treatment 
are presented, the basic assumption 
being that the patients studied would 
have been worse if untreated. In 
many of the studies which have em- 
ployed controls (see Table 7), the dif- 
ficulty lies in the nature of the con- 
trols used. In some investigations 
old studies such as Bond's have been 
cited and used as norms against which 
to compare shock results. These data 
ought not to be used as standards or 
base lines since, as we have seen, they 
were derived from total hospital pop- 
ulations. Moreover, diagnostic cri- 
teria have changed and probably the 
character of the patient population 
has changed, too, as a consequence of 
mental health education, interest in 
psychiatry, and increased use of the 
specific therapies in private practice 
as well as in hospitals. The use of the 
control data of other workers and 
other hospitals as reported in the 
literature, no matter how recent, is 
valueless because of the discrepancy 
in diagnostic criteria. Yet this prac- 
tice has not been abandoned by in- 
vestigators (29). 

Several investigators, appreciating 
the inadvisability of employing con- 
trol data from other studies, have 
assembled controls for their particu- 
lar research. But a review of Table 7 
will reveal that this practice is not 
without its difficulties. Often such 
control groups include patients in 
whom the treatment is contraindi- 
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cated (82, 83), whose symptoms are 
too mild for treatment by the particu- 
lar therapy under investigation (82, 
83), or who are chronic and deteri- 
orated cases. Some control groups 
include an assortment of all these 
different types (82, 83). Occasion- 
ally, the nature of the control group 
is merely described as a comparable 
group but without shock treatment. 
No details are given (79). At times, 
control cases are selected at random 
by a secretary (47). Very often the 
control groups are not given the 
same amount of motivation as the 
treated groups, so that their morale 
may be lower during the observation 
period. On that account the groups 
may not be strictly comparable. Some- 
times the control groups or some of 
their members are given other treat- 
ment during the period of follow-up 
(23). 

Needless to say these practices 
render the findings of such research 
worthless. The problem of establish- 
ing controls in psychiatric research 
has provoked much discussion in the 
literature, some writers taking points 
of view that are diametrically op- 
posed. Curran (14) has indicated 


that obtaining controls in psychiatry 
is an arduous task, for it is not easy 
to assemble groups of patients who 
can be validly compared. In discuss- 
ing electric shock treatment and its 
results, Reynolds (67) has taken a 


similar view. Moreover, Curran 
argues that it is never possible to get 
untreated groups, for he feels that it 
is not possible to determine the na- 
ture of a reaction without altering 
that reaction at least to some extent, 
since in any medical examination 
some impressions as well as some 
recommendations are made. 

The selection of controls is no easy 
matter. First of all, our ignorance of 
the causes of mental diseases mili- 
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tates against matching controls and 
treated cases with certainty. Sec- 
ond, it is not right, ethically speak- 
ing, to withhold certain treatments, 
just as it would be wrong to admin- 
ister certain untried treatments, for 
research purposes only. One feature 
which has helped the research worker 
is the failure of some families to con- 
sent to a specific therapy for which a 
particular patient has been selected. 
Unfortunately, the attitude of the 
family may be a factor in the final re- 
lease of the patient, thus producing a 
discrepancy between the treated and 
untreated groups (22). 

Occasionally the objection is raised 
that the morale of the control groups 
is lowered by failure to receive the 
specific therapies. This objection has 
been met by Notkin (61, 62), for ex- 
ample, who gave intramuscular injec- 
tions to control group patients while 
the treated patients received insulin. 
This objection may also be ade- 
quately met by providing the control 
group with “total push.”’ In this con- 
nection, it is interesting to note that 
Mettler (57) has urged that three 
comparable groups be used, if possi- 
ble: (a) one group should be followed 
to investigate the degree of spontane- 
ous improvement which may occur; 
(b) another should receive the specific 
therapy to be evaluated; (c) if a third 
group exists, it should be subjected to 
all the nonspecific aspects of the ther- 
apy to be evaluated. Theoretically 
Mettler's suggestion is sound, but in 
practice it further complicates the 
problem by requiring the selection of 
three comparable groups instead of 
two. Since the use of controls is im- 
perative in any sound experimenta- 
tion, this important criterion contin- 
ues to be one of the troublesome fea- 
tures of measuring the effectiveness 
of the somatotherapies. 

One method is to lay down a mini- 
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mum number of variables on which 
the treated and the untreated groups 
should be comparable. Among such 
fundamental variables are: age, sex, 
age at onset of illness, type of onset, 
type of illness, type of treatment, 
duration of follow-up. Once com- 
parability is attained on these funda- 
mental variables, additional vari- 
ables on which the control and 
treated groups may differ can be con- 
trolled by analysis of covariance or 
similar methods. A list of variables 
which may be important in determin- 
ing outcome is currently under study 
in an investigation of prognosis (Zu- 
bin, J., Peretz, D. and Ossipow, S., 
Psychiatric Prognosis—in prepara- 
tion). 


Follow-up 


In both the uncontrolled and the 
controlled studies, the duration of 
follow-up is frequently unspecified. 
In some studies it varies from patient 
to patient. The duration of follow-up 
is a prime consideration, for immedi- 
ate evaluations of the outcome of 
therapy or evaluations based upon 
short follow-up periods make no al- 
lowance for the possibility of later 
relapses. Thus the recovery rates in 
studies that give the patients’ status 
immediately or shortly after termina- 
tion of treatment are spuriously high. 
Menninger (56) and Alexander (2) 
have for this reason both emphasized 
the importance of the time factor in 
evaluation studies by stressing how 
different therapeutic results may ap- 
pear after varying intervals following 
treatment. In general, long-term fol- 
low-up, preferably a period of five 
years, is the ideal. Only a few studies 
meet this requirement (18, 34, 39). 
It is of course difficult and expensive 
to keep each member of a population 
under observation for so long a pe- 
riod. Furthermore, populations are 
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not static. There are remissions, 
deaths, and patients lost to the study 
because of removal from the com- 
munity. Dorn (16) has cautioned 
that the validity of a follow-up study 
is very questionable if each individual 
is not followed for the maximum pos- 
sible duration. Naturally it is under- 
standable that all patients cannot 
always be followed. The investigator 
should mention how many could be 
traced after a given follow-up period, 
but this information is seldom sup- 
plied. More commonly, conclusions 
are based on the number for whom 
information is available with no refer- 
ence to missing cases. Obviously this 
gives rise to biased results. 

While we have not indicated in our 
tables what method of follow-up was 
employed in each study, it should be 
mentioned here that in a large num- 
ber of studies the procedure is rather 
haphazard. Rarely is it uniform for 
all patients. As a rule, questionnaires 
are sent to patients or to their fam- 
ilies. Social service or psychiatric 
interviews are given to some, but not 
to all. Occasionally in the same study 
some patients are followed up in 
every possible way, some in only one 
way, failing which, the patient is lost 
to the study. The variation in follow- 
up methods for subjects in the same 
investigation is considerable. 

In general the interview is consid- 
ered preferable to the questionnaire 
by all workers. Ideally, for re-exam- 
ination, the psychiatrist as well as the 
social worker should see the patient, 
preferably the same psychiatrist and 
social worker responsible for original 
examination. Patients’ reports of 
their own status or those obtained 
from their families, important as they 
are, should not constitute the sole 
basis for evaluation. 

Another problem of follow-up which 
is critical is the treatment of deaths. 
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Some workers such as Karagulla (43) 
exclude deaths; others such as Slater 
(72) include them in computing the 
improvement rate. In several studies 
deaths are not even reported. It has 
been suggested (85) that Jerzey Ney- 
man’s (21) mathematical models 
can provide the real answer to this 
problem through the computation of 
net improvement rates from which 
the influence of deaths and relapses 
has been eliminated. 


Criteria of Evaluation 


In reviewing the studies on the out- 
come of therapy one cannot help be- 
ing impressed with the number of out- 
come categories which research work- 
ers have been able todevise. Recovery 
may be variously expressed as com- 
plete recovery or recovery, 
while improvernent may be described 
in terms of improved, much im- 
proved, slightly improved, little im- 
provement, and the like. In our tabu- 
lations we have grouped these various 
headings under the more general cap- 
tions, Recovered and Improved, Un- 
improved and Dead. Not only do dif- 
ferent investigators use different cate- 
gories, but they define the same cate- 
gories differently. Objective terms 
are needed to express changes in the 
patient’s condition after treatment. 
Objectivity, however, is difficult to 
achieve wherever the nature and ex- 
tent of improvement are obscure, as 
is usually the case in psychiatric dis- 
orders. One must determine whether 
a specific change in a patient's condi- 
tion signified improvement. The ex- 
tent of error in judging the presence 
of improvement and its degree should 
Gjessing (71) in 1938, 
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known what a given worker means by 
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“much improved,” urged that inter- 
national standards be defined for 
these terms. One can only regret, 
after a review of the current litera- 
ture, that Gjessing’s suggestion was 
never implemented. 

Because of the lack of uniformity 
and objectivity in reporting outcome, 
exact comparison of the various 
studies is difficult. It appears that 
the only uniform category for all 
studies is the Dead, although in some 
studies the number of the dead is not 
listed separately, being included under 
other categories according to the 
status at the time of death. Where 
deaths are reported, the figures are 
often presented apologetically, at 
times with reassuring observations to 
the effect that death was not really 
due to the treatment. 

In reporting outcome, workers 
have demonstrated considerable vari- 
ation in the use of numbers or per- 
centages of patients. Some authors 
use parole or discharge as the crite- 
rion of evaluation. In spite of its limi- 
tations, as previously mentioned, this 
criterion does seem to be about the 
most satisfactory, since it reduces the 
classification to a dichotomy. The 
patient is either im or out. Where 
there are multiple classifications, 
they are invariably based upon sub- 
jective clinical evaluations with con- 
sequent increase in the possibility of 
error. More recently, as in the Brain 
Research Project at the New York 
State Psychiatric Institute, objective 
rating scales have been employed 
with a scale of numbers ranging from 
1 to 5 (excellent . most unfavor- 
able outcome). This lends itself more 
readily to statistical evaluation with- 
out in fact eliminating the subjective 
character of the rating. Such a rating 
scale, universally used, would make 
for uniformity of outcome categories, 
and by reducing the present variation 
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from study to study, would facilitate 
comparisons of various investigations. 


NEED FOR PLANNED RESEARCH 


Various suggestions for planned 
research (73) have been formulated 
from time to time by different inves- 
tigators. Thus Luff (49) urged close 
cooperation between workers in dif- 
ferent mental hospitals, pointing to 
the kind of cooperative inquiry em- 
ployed successfully in cancer research 
as an example of what could be done. 
In 1937 Luff suggested that mental 
hospitals keep standard records and 
institute follow-up systems such as 
are maintained for cancer study. 
Systematic research seems to be just 
as necessary in the field of mental dis- 
ease as for physical disease. After re- 
viewing the literature for the present 
paper, however, the writers are left 
with the impression that a large per- 
centage of the articles on the evalua- 
tion of the somatotherapies was in- 
spired by the mere fact that certain 
data had been collected. An article in 
a psychiatric journal seemed a nat- 
ural way to make use of them. In 
other words, the planning often seems 
to have come last. This is probably 
the reason why a large number of in- 
vestigations reported in the literature 
on evaluation of therapy appear pur- 
poseless, disorganized, and poorly ex- 
ecuted, despite their impressive ar- 
rays of statistics. Various complaints 
have been raised against these sur- 
veys of therapeutic results. Some 
critics, like Israel and Johnson (41), 
have felt that prevailing statistics do 
not accurately portray the really 
hopeful prognosis for the mentally ill. 
Others have complained that present 
statistics are too optimistic. 

As early as 1930, Sakel (71), 
alarmed by the already numerous 
statistical reports on shock therapies, 
cautioned that psychiatrists should 
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not place too much reliance on sta- 
tistics because of the lack of knowl- 
edge as to the true nature of the men- 
tal diseases which they were study- 
ing, especially schizophrenia. In his 
opinion most of the researchers were 
dealing with symptoms, for ‘they 
had no test to define the nature of 
schizophrenia, and therefore they 
could not set up a test of ‘cured’ and 
‘not cured’.”” In like vein, Alexander 
(1) has complained about “the de- 
lirium of numbers” in personality re- 
search. Similarly, Lewis (46) in dis- 
cussing the status of shock therapy in 
1943 emphasized the disagreement in 
results and reminded his colleagues 
that statistical manipulation of un- 
reliable data is fruitless. Moreover, 
there seems to be a clear division be- 
tween one group of workers who rely 
on clinical judgment and another 
group who enlist statistics in apprais- 
ing the results of the psychiatric ther- 
apies. In any case, research evaluat- 
ing the outcome of therapy needs 
careful planning. 


SUMMARY 


This review of the literature evalu- 
ating the somatotherapies reveals 
that a large number of studies have 
been inadequately planned and poorly 
designed. Several serious 
have been observed. Among them 
are: (a) lack of homogeneity of patients 


defects 


studied in respect to diagnostic classi- 
fication, aye, duration of illness, and 
follow-up; (6) too brief, poorly exe- 
cuted, or inadequately reported fol- 
low-up; (c) lack of controls or poorly 
selected controls; (d) inadequate, ill- 
criteria for 
evaluating outcome; (e) failure to re- 
port deaths, especially for follow-up 
studies, or inclusion of the dead under 


defined, or unspecified 


the category representing their status 
at the time of death, 
In spite of their individual limita- 
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tions these studies, taken in the ag- 
gregate, have demonstrated short- 
term advantages but have not dem- 
onstrated definitely significant ad- 
vantages for the specific somatother- 
apies in the long run. Our review of 
the literature, however, has revealed 
the following facts: 

1. Where only immediate outcome 
is reported, there seems to be a dis- 
tinct advantage for treated groups as 
compared with untreated ones. Their 
stay in the hospital is reduced. The 
death rate is apparently lower for the 
treated. Such results should not be 
underestimated, for they may mean 
that suicides and deaths from inani- 
tion have been reduced among de- 
pressed patients; that human suffer- 
ing has been alleviated and that pri- 
vate and state funds for hospital care 
have been saved even though relapses 
do occur. The somatotherapies help 
to save human life. 

2. Long-term follow-up studies have 
not generally shown better results for 
the treated over the untreated in 
terms of recovery and improvement. 
The recovery rate still hovers around 
35% to 40%. More patients tend to 
recover in nonspecific groups (and 
even in the control groups) after five 
years, whereas more relapses occur 
among the treated patients. 

3. Generally speaking, the specific 
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APPENDIX 
Key TO TABLES 
Types of Patients 


Alcoholics 

Affective psychoses MD-d; 
MD-m; IM 

All other psychoses 

All psychoses 

Carcinoma 

Psychosis with 
sclerosis 

Psychosis with cerebral syphilis 

Drug psychosis 

Dementia praecox 

Depressed states 

Epilepsy 

Encephalitis 

General paresis 

General paralysis 

Involutional melancholia 

Involutional melancholia, melan- 
cholic 

Involutional melancholia, paranoid 

Mixed psychoses 

Manic depressive, agitated depres- 
sion 

Manic depressive psychosis 

Manic depressive, depressed 


Mental deficiency 


cerebral arterio- 


MD-mix 
MD-m 
Mo-h 

N 

oO 

Ob 
OBD 
OBND 
Pa 
PN 
P.inf. 
P.pell. 
P.som. 
PsPath 
RD 
SA 

Sx 

SP 

Uy 

Ud 
Uns 


V 


Mani depressive, mixed 
Manic depressive, manic 
Morphia hallucinosis 
Neurosis 

Other 

Obsessive states 

Organic brain damage 
Other brain or nervous diseases 
Paranoid 

Psychoneurosis 
Psychosis with infection 
Psychosis with pellagra 
Psychosis with somatic disease 
Psychopath 

Reactive depression 
Senile arteriosclerosis 
Schizophrenia 

Senile psychosis 
Unclassified 

Undiag nosed 

Unspecified 

Varied 
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Duration of Iliness 


Acute ° Average 
Chroni About 
Subacute < Less than 
Unspecified > More than 
Varying 


Duration of Follow-up 


Immediate ° Average 
Unspecified . About 
Varying Less than 
After admission More than 
After first admission 

After disch irye 


Types of Somatotherapy 


Cardiazol and electroconvulsive Me Metrazol 

therapy ECT Electroconvulsive 
Insulin 4 Lobotomy 
Insulin with few comas PNa Prolonged Narcosis 


Results 


Dead (a) Paroled successfully 
Improved (b) Complete & social recovery 
Much improved (c) Improved & slightly improved 
Recovered (d) Full recovery 

Unimproved 

Otherwise discharged 
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It is the purpose of this paper to 
present a theoretical approach to per- 
ception and thought which, although 
by no means entirely new, will un- 
doubtedly seem strange and un- 
orthodox to many. The term ‘“‘micro- 
genesis,” first coined by Werner 
(132) as an approximate translation 
of the German word Aktualgenese, 
will refer here to the sequence of 
events which are assumed to occur in 
the temporal period between the 
presentation of a stimulus and the 
formation of a _ single, relatively 
stabilized cognitive response (percept 
or thought) to this stimulus. More 
specifically, the term will refer pri- 
marily to the prestages of extremely 


brief cognitive acts, e.g., the processes 
involved in immediately perceiving a 
simple visual or auditory stimulus, 
conceptually generating a word as- 


sociation, etc. Thus, cognitive se- 
quences involving many seconds or 
minutes, such as perceptual changes 
resulting from prolonged fixation, 
will not be considered here as exam- 
ples, or at least as typical examples, 
of microgenetic development. Within 
this somewhat restricted conception 
of microgenesis or microdevelopment, 
one can distinguish, in terms of ex- 
‘mi- 


perimental operations, between 
crogenesis of thought’’ and “micro- 
genesis of perception.”” In the former 
case we refer to situations in which 
little attention is given to the condi- 
tions of stimulus or task presentation 
but careful attention is paid to the 
temporal development of the con- 
ceptual response. In the latter case, 
we refer to conditions in which con- 
siderable attention is paid to the 


manner of stimulus presentation but 
little, if any, is paid to the temporal 
evolution of the ensuing verbal re- 
sponse. The experimental paradigm 
of microgenesis of thought consists of 
presentation of a stimulus to cogni- 
tion, under optimal 
perceptual “intake,”’ and some sort of 
attempt to study, or even control, the 


conditions of 


evolution of the cognitive response to 
this stimulus. The paradigm of mi- 
crogenesis of perception, on the other 
hand, usually entails the successive 
presentation of a stimulus under con- 
ditions of increasing clarity. Succes- 
sive tachistoscopic presentation of 
visual stimuli with times 


gradually increasing until complete 


exposure 


perception is possible, considered as 
the experimental homologue of the 
everyday, near-instantaneous process 
of simply 
perhaps be the best example of this 
should be mentioned 


seeing’ an object, would 


paradigm. It 
that the distinction made here stems 
from 
experimental conditions and does not 
imply a particular brief for or against 
any dichotomy between 
ception and thought 
In attempting to 
within a 


a distinction between typical 


basic per- 
com eptualize 
processes microgenetic 
framework, at least two basic ques- 
tions arise. From the evidence avail- 
able, what formal principles of cog- 
nitive microdevelopment have been 
or could be derived to constitute a 
first beginning of a 
theory? Of what use would such a 
theory be in organizing known facts 
of normal and abnormal perception 
and thought and in constructing 
testable hypotheses for future re- 


microgenetic 
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search? It is hoped that this paper 
will suggest partial answers to these 
questions. We shall first survey some 
of the theoretical and experimental 
work which seems to us to bear upon 
the first of the two questions. Follow- 
ing this, some tentative notions will 
be proposed with regard to the second 
question. 


MICROGENESIS OF PERCEPTION 


There is a fairly sizable body of 
literature concerned, in one way or 
another, with the temporal evolution 
of percepts. A good half of these 
studies emanate directly from one 
microgenetically oriented ‘“school”’ 
and, in this sense, form a tightly knit 
whole. The remainder of the investi- 
gations differ widely among them- 
selves as to theoretical orientation, 
experimental procedure, etc. In this 


section we will first describe the con- 
tributions of the former group of 
studies and then compare their find- 


ings with those of the miscellaneous 
remaining experiments. 

In the early twenties, there arose in 
Germany a movement against post- 
Wundtian elementaristic psychology 
led by Felix Krueger of Leipzig. Like 
the better-known Berlin group, 
Krueger and his followers were Ge- 
staltists and stressed the intrinsic 
structuredness of perception. Unlike 
the Berlin school, however, they were 
particularly concerned with the tem- 
poral development of percepts as well 
as with the formal properties of com- 
pleted percepts. Krueger developed 
a complicated and somewhat esoteric 
general theory which is of only 
tangential relevance to microgenesis 
(59). His co-worker Sander, how- 
ever, did develop an explicitly micro- 
genetic theory of perception within 
Krueger's framework (94, 95) and, 
with his students, carried out a 
variety of experimental studies on the 
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problem. He believed that percep- 
tion is a developmental process con- 
sisting of a number of conceptually 
distinct phases. Further, he assumed 
that percepts obtained under inade- 
quate stimulus conditions, e.g., brief 
tachistoscopic exposure, are essen- 
tially the same as the initial, transi- 
tory percepts which precede the final 
perceptual response under normal 
stimulus conditions. He granted that 
the precursors of the-final percept are 
not observable in the normal, per- 
ceptual process. However, he argued 
that if one experimentally blocks the 
formation of clear, complete percepts 
by presenting stimuli very briefly, in 
bad lighting, in peripheral vision, etc., 
one can elicit these perceptual pre- 
cursors. On the basis of experimental 
findings, Sander was able to offer a 
fairly detailed description of per- 
ceptual microgenesis or A ktualgenese, 
as he called it. Our account of the 
process will follow that of Undeutsch 
(112), one of Sander’s students. 

When a perceptual stimulus is pre- 
sented under conditions of gradually 
increasing clarity, the initial percep- 
tion is that of a diffuse, undifferenti- 
ated whole. In the next stage figure 
and ground achieve some measure of 
differentiation, although the inner 
contents of the stimulus remain 
vague and amorphous. Then comes a 
phase in which contour and inner 
content achieve distinctness 
and a tentative, labile configuration 
results. Finally, the process of 
Gestalt formation becomes complete 
with the addition of elaborations and 
modifications of the “skeletal Ge- 
stalt"’ (Gestaligeriist) achieved in the 
previous stage. 

As development proceeds, external, 
objective characteristics more and 
more supplant inner, personal factors 
as determinants of the structure per- 
ceived. As Undeutsch puts it, the 


some 
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balance of endogenous to exogenous 
determinants changes as perceptual 
microdevelopment proceeds, Of par- 
ticular interest to Sander and his 
students was the stage just preceding 
the formation of the final, stable per- 
cept. In this Vorgestalt or precon- 
figuration phase the S has con- 
structed a tentative, highly labile 
Gestalt which is more undifferenti- 
ated internally, more regular, and 
more simple in form and content than 
is the final form which is to follow it. 
The construction of this initial, flux- 
like pre-Gestalt is said to be accom- 
panied by decidedly unpleasant feel- 
ings of tension and unrest which later 
subside when a final, stable configura- 
tion is achieved. The emotionally- 


charged character of the Vorgestalt 
stage is stressed by many investiga- 
tors (38, 40, 65, 96, 107, 116, 136) 
whose reports are often supplemented 
by colorful and dramatic verbal re- 
ports by the Ss. 


These then were the near-unani- 
mous conclusions of Sander and his 
students with respect to the micro- 
genesis of percepts. Under what ex- 
perimental situations were these find- 
ings obtained? The Sander group 
showed no lack of imagination in 
their efforts to study Aktualgenese 
under all possible conditions. Some 
investigators presented stimuli under 
gradually increasing tachistoscopic 
exposure time. Using this technique, 
paintings by famous artists (65), 
three-dimensional geometric figures 
(38), and groups of everyday objects 
(73) were presented to Ss and per- 
cepts elicited at each exposure time 
were recorded and analyzed. Sommer 
(107) varied this procedure by gradu- 
ally decreasing, rather than increas- 
ing exposure time and was able to 
show, in reverse order, the usual se- 
quence of developmental stages. 
Wohlfahrt (136) presented geometric 
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designs in extreme miniature at first 
and gradually increased their size 
until Ss were able to see them clearly 
and without effort. Butzmann (9) 
also using geometric designs, re- 
corded perceptual alterations as stim- 
uli were gradually moved from the 
extreme periphery of the visual field 
in towards a central fixation point. 
Other investigators used stimuli or 
arrangements of stimuli which were 
meaningless or disorganized and com- 
pared perceptual development under 
such conditions with that which oc- 
curred when meaningful, organized 
stimuli were used (23, 47, 48). Addi- 
tional investigations conducted’ by 
the Sander group involve the. micro- 
genesis of tactile impressions (40), the 
temporal describing 
clearly seen objects (116), and mis- 
cellaneous other problems (93, 96). 
Such was the variety of stimulus 
conditions employed. The perceptual 
responses on which the theory was 
based were obtained in either of two 
ways: (a) simple introspection, or 
verbal report (40, 65, 96, 107); (b) 
pictorial reproduction of what was 
perceived (9, 23, 47, 48, 73, 136), sup- 
plemented in one case by manual 
arrangement of concrete stimulus ob- 
jects in attempted duplication of the 
percept (40). 

It is thus apparent that Sander and 
his group made a vigorous and con- 
certed attack on what they saw as an 
important problem in perception. In 
reading through the variety of paral- 
lel studies done outside of Germany 
one is struck by the fact that the 
great bulk of Aktualgenese research is 
seldom cited. Similarly, references to 
non-German experiments on per- 
ceptual microgenesis are equally in- 
frequent in the work of the Krueger 
school. This lack of cross fertiliza- 
tion, however unfortunate in some 
ways, does make it possible to com- 


process of 
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pare the experimental conclusions of 
scientists who are not mutually 
tainted by each other's theoretical 
preconceptions. It is therefore in- 
teresting to note that Sander's as- 
sertion that microgenesis begins with 
diffuse, whole percepts which subse- 
quently become sharpened and in- 
ternally differentiated receives con- 
siderable confirmation from other 
studies. For example, experimenters 
using such different stimuli as geo- 
metric figures (6), letters of the 
alphabet (20), Rorschach (80, 109) or 
self-made (31) inkblots, Rubin figure- 
ground cards (134), and various kinds 
of pictures (12, 22, 103), have also 
reported developmental sequences in 
the general direction of diffuse to 
specific. Further, Brigden (6) found 
a tendency towards simplification, 
completion, transposition, and in- 
creased symmetry as development 
progressed—a finding quite congru- 
ent with Sander’s statement that 
percepts at the Vorgestalt stage tend 
to be made “better Gestalten”’ at the 
expense of object similarity. Brigden 
also lists an early tendency to compli- 
cate the percept which the Sander 
group did not explicitly postulate. It 
is, however, possible that this compli- 
cation tendency is not unlike the 
microgenetically early overinvest- 
ment of meaning noted by Dyn (23), 
Hippius (40), and Johannes (47, 48). 
As to the microgenetically late trend 
from specific details to integrated 


wholes, tachistoscopic studies using 
Rorschach stimuli confirm this only 


in part (80, 109). In these latter 
studies a continuous, nonreversing 
trend from wholes to details is found, 
although those whole responses which 
are given in the end stages do tend to 
be of the integrated, internally dif- 
ferentiated rather than global type. 
On the debit side, the intense emo- 
tionality which the Sander group re- 
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ports as an invariable concomitant of 
Vorgestalt formation is certainly not 
stressed by most other investigators, 
although Douglas (22) makes explicit 
mention of it. There are other minor 
disagreements between the findings 
of the Aktualgenese group and those 
of other investigators. Since the ex- 
perimental methods used in the 
studies to be compared are often only 
roughly equivalent, it is difficult to 
interpret the meaning of such dis- 
agreements with any confidence. 

Before concluding our account of 
experimental studies of perceptual 
microdevelopment it must be men- 
tioned that many of these studies 
would be considered quite poor by 
present-day methodological stand- 
ards. This is especially, although not 
exclusively, true of the research done 
by Sander and his school. Few Ss 
were used and these were seldom ex- 
perimentally naive, statistics were 
inadequate or absent, and methods of 
measuring and evaluating perceptual 
responses were informal to say the 
least. In addition, serious questions 
concerning basic assumptions can be 
posed, as will be seen later. Nonethe- 
less, the existing German and non- 
German studies together constitute a 
rather extensive and exciting first 
assault on the truly fundamental 
problem of how our percepts get 
formed. As will soon be apparent, 
there has been considerably less sys- 
tematic experimental work done on 
the equally fundamental problem of 
how our thoughts develop. 


MICROGENESIS OF THOUGHT 


If one wished to apply the term 
“‘microgenesis of thought” to all pub- 
lished accounts of the cognitive steps 
involved in solving a problem, the 
relevant literature would be vast in- 
deed. Humphrey (44), Johnson (49, 
50), Osgood (75), Vinacke (118, 119), 
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Woodworth (137), Woodworth and 
Schlosberg (138), and others have 
given ample reviews of the multitude 
of studies which describe the tempo- 
ral sequence of concept acquisition or 
problem solution. Likewise, there are 
a number of published accounts of 
the microdevelopment of creative 
thinking, most of which have been re- 
viewed by Vinacke (119) and Wood- 
worth (137). Wallas (122), for exam- 
ple, divided the development of a 
creative thought into four stages: 
preparation, incubation, illumination, 
and verification. Patrick (76, 77, 78) 
and Eindhoven and Vinacke (25) 
conducted laboratory studies which 
attempted to test Wallas’ assertions. 
Although to define the limits of a 
single thought formation is admit- 
tedly a hazardous procedure, it may 
be fairly safe to assume that many, 
many thought formations occur in 
any solution sequence as extended as 
those typically involved in studies of 
creative thinking, formal problem- 
solving (24), and the like. It may be 
that laws of cognitive development in 
a solution process which extends over 
hours, days, or even years are of a 
piece with those pertaining to a 
“single” thought which requires sec- 
onds or fractions of a second to run 
its course, although at present we see 
no good evidence for such an identity. 
In any case, the present discussion 
will be confined primarily to those 
few studies which concern the nature 
of thought in relatively brief cogni- 
tive sequences. 

As is well known, the classical con- 
troversy between the Cornell and 
Wiirzburg schools involved, among 
other things, a dispute as to whether 
images were or were not the ‘“‘car- 
riers’ of mental life (4, 44). In their 
attempts to settle the question by 
means of introspection studies the 
members of these schools were of 
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necessity concerned with what lay 
behind completed cognitive acts. 
Although they were not explicitly 
concerned with constructing micro- 
genetic theories, some of their find- 
ings bear upon the development of 
thought as we are defining it. For 
example, despite differences in opin- 
ion as to whether or not thoughts are 
fundamentally imaginal in substance, 
both factions reported evidence that 
images may play a variety of roles in 
the microgenetic sequence. Thus, 
according to Humphrey's account 
(44, pp. 283-288), various introspec- 
tive studies suggest that images some- 
times seem merely to illustrate or ac- 
company thoughts already in prog- 
ress, sometimes serve as starting- 
points for subsequent thought micro- 
genesis, and sometimes even consti- 
tute distractions by leading S to 
dwell upon the images instead of 
progressing in the thought sequence 
or by leading S to a thought wholly 
irrelevant to the cognitive task at 
hand. Further, Willwoll (44) found 
that the images which impede think- 
ing tend to be more clear and con- 
crete than those which do not. Al- 
though its function may be highly 
variable from instance to instance, it 
is perhaps safe to conclude that imag- 
ery, when it occurs, tends to be a 
phenomenon characteristic of the 
earlier stages of thought microde- 
velopment. 

In addition to their studies of the 
role of imagery in the microdevelop- 
mental process, these early psycholo- 
gists, especially the Wirzburgers, 
made some interesting observations 
about the developmental sequence as 
a whole. Thus Messer (70) distin- 
guished between vague, undeveloped 
thoughts without words or images 


and fully formulated propositions 


with clear consciousness of meaning. 
For example, one of his Ss gave the 
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following introspection after having 
responded “corner” to the stimulus 
word “angle’’: “The tendency was 
towards the well-known proposition 
that the sum of the angles of a tri- 
angle equals two right angles . . . but 
it did not mature” (p. 178). Bitihler 
(8) studied somewhat more complex 
thought problems, instructing his Ss 
to “solve” a variety of proverbs, 
aphorisms, etc., and to report their 
introspections of the solution process. 
On many occasions the Ss would re- 
port that they had, early in the solu- 
tion sequence, vague, imageless half- 
thoughts or premonitions about such 
things as the task, the nature of the 
solution, the possibility or impossi- 
bility of solution, the problem's rela- 
tionship to other problems, etc.' 
Bihler's data suggest that very early 
thoughts seem to serve somewhat as 
global schemata which orient the 
thinker as to the nature of the solu- 
tion. That is, the thinker may have 


experiences of vaguely knowing where 
the solution will lie, with what prob- 
lems or persons the solution is associ- 
ated, how difficult the solution will 
be, and so on, considerably prior to 
possessing the fully formulated solu- 


tion—prior to thinking the problem 
through. The S’s introspections sug- 
gest that it is as if the final solution 
somehow differentiates out of the 
diffuse, generic-like, ‘“‘framework’’ 
thoughts which precede it. Whether 
or not these microdevelopmentally 
immature thoughts always have an 
image-like composition is a question 
which seems less important to us to- 
day than does the question of the role 
these early thoughts play in the de- 
velopmental process. 

There have” been a few psycholo- 
gists, from the beginning of the cen- 

'For a wealth of anecdotal evidence for 


such microdevelopmentally early thoughts, 


see Wallas (122, Chap. 4). 
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tury down to the present day, who 
have more or less explicitly theorized 
about the microgenesis of thought. 
One of the earliest of these was Jung 
(51), who considered the problem in 
the context of his studies of word as- 
sociation. He expressed the belief 
that ‘superficial’ word association 
responses, such as clangs and word 
and phrase completions, are the 
initial, immediate cognitive responses 
to words and that they are normally 
suppressed in favor of the more 
meaningful responses which follow 
them in the apperceptive process. 
Jung posited a temporal hierarchy 
of modes of word cognition which 
progresses, in the course of the apper- 
ceptive process, from the most super- 
ficial cognition of the physical char- 
acteristics of the word, through a 
cognition of the word as a member of 
a familiar phrase, and finally through 
cognition of the word’s denotative 
and connotative meanings. 
Somewhat later, Pick and Thiele 
(81), Van Woerkom (113, 114), and 
Bouman and Griinbaum (5) formu- 
lated hypotheses about thought de- 
velopment in the course of their work 
with aphasics. Pick and Thiele, 
drawing upon the earlier work of 
Biihler and Messer, suggested that 
the word cognition process typically 
goes through a series of stages which 
are, in part, somewhat reminiscent of 
Jung's formulation: (a) recognition of 
the word as a physical object, an 
“acoustic Gestalt,” (6) an awareness 
of the general ‘‘meaning sphere”’ of 
the word, i.e., location of the word in 
conceptual space, (c) comprehension 
of the grammatical form of the word. 
Pick and Thiele state that the succes- 
sion of these stages is not invariable 
and that more stages may be involved 
if S is required to make a verbal 
formulation of his cognition. Bou- 
man and Griinbaum conceive of the 
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cognition of a stimulus as beginning 
with a total, amorphous general im- 
pression (e.g., “good’’ or “bad,” 
“right” or “‘wrong’’) which, in normal 
individuals, is followed by successive 
differentiations of the total stimulus 
into its component meaningful parts. 
Similarly, Van Woerkom insists that 
the developmental process typically 
begins with the conception of the 
whole idea, with a stage of analysis 
and synthesis following. 

A more recent theoretical exposi- 
tion of the course of thought develop- 
ment in forming word associations is 
given by Rapaport, Gill, and Schafer 
(84) and Schafer (97). They suggest 
that the normal process of giving a 
word association to a stimulus word 
consists of two principal microde- 
velopmental phases: an analytic, de- 
compository stage in which the stim- 
ulus word is broken down into its 
component ideas and one of these 
ideas is selected as the basis for the 
association to come; following this, a 
synthetic, compository phase in 
which the response word is con- 
structed from a thought associated 
with this particular component idea. 
In both phases the associative process 
is assumed to be guided by an over-all 
set to produce a response word con- 
ceptually related to the stimulus 
word, a set which becomes even more 
specific when S hears the stimulus 
word. When for any reason the 
thought process does not pass 
through both phases the resulting 
associative response will be atypical. 
Rapaport et al. designate as close 
those responses which indicate that 
the process has not proceeded past 
the first, analytic phase and as distant 
those which suggest that the syn- 
thetic process has overdeveloped in 
an associative sequence tangential or 
irrelevant to the task-induced antici- 
pation. Thus, close associations in- 
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clude repetitions of the stimulus 
word, attributes, clangs, and phrase 
completions—associations which indi- 
cate, as Jung had suggested earlier, 
that the associative process has been 
“aborted” early in its microdevelop- 
ment. Those responses which are 
logically unrelated or very marginally 
related to their stimulus words are 
scored as distant associations, the 
presupposition being that intermedi- 
ary the synthetic 
phase have constituted the connect- 
ing links between the stimulus word 
and the seemingly irrelevant re- 
sponse word. Despite differences in 
basic theoretical orientation, Jung 
and the Rapaport group appear to 
agree, at least implicitly, on several 
points of importance to microgenetic 
theory. First, they both consider the 
task of giving a word association to a 
verbal stimulus as a simple thought 
problem, the study of which may shed 
light on cognitive processes in general. 
Further, they believe that producing 


associations in 


word associations is a microdevelop- 


mental process in which successive, 
and perhaps conceptually distinct 
stages occur within a brief time span. 

By far the most explicit theoretical 
elaboration of a microgenetic view of 
thought formation has been given by 
Schilder (98, 99). According to this 
theorist, thought begins with a diffuse 
conception of its goal, some sort of 
vague direction in which it is to go. 
The early stages of its development 
from this point onward he termed the 
preparatory phase of thought. In this 
phase a host of mental contents (pres- 
entations as Schilder called them) 
feed into the ongoing thought de- 
velopment. These vague presenta- 
tions may be logically relevant or 
irrelevant in relation to the thought 
nucleus which is at this time gaining 
ever-increasing structure and clarity. 
Those ideas or images which are rele- 
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vant are incorporated into the proc- 
ess and enrich the forming thought; 
those which are irrelevant normally 
get suppressed and at most remain 
only as “background music’’ for the 
evolving thought. In this early, pre- 
paratory period mental contents are 
said to be of a symbol- and imagelike 
character, very susceptible to fusions 
and condensations with each other 
and with the developing thought 
structure, and subject to emotional 
restructuring in accordance with 
what we would today term “primary 
process” influence. The logic by 
which certain of these primitive pres- 
entations rather than others come to 
the fore is not specifically described 
beyond stating that contiguity and 
similarity, especially similarity of ex- 
ternal, superficial attributes, play 
major roles. Schilder further states 
that, as the development progresses, 
the thought structure normally be- 
comes more and more reality-oriented 
and less and less wish-determined 


Undeutsch (112) had said the same 
thing about perceptual development 

as well as less ridden with concrete 
imagery, less symbolistic, less un- 


differentiated and unstable, ete. 
Schilder’s theory of microgenesis can 
of course be roundly criticized on a 
number of grounds. The referents of 
many of his terms are highly obscure, 
his exposition proceeds unencum- 
bered by restraint or caution, his 
thesis lacks direct evidence, and so 
forth. Nevertheless, it can be said 
that he has fashioned a series of 
strikingly imaginative and original 
hypotheses about an aspect of cog- 
nition which has sadly needed explicit 
theorizing, however high-flown and 
speculative. 

At this point our rather meager 
history is completed and stock-taking 
is in order. Although the existing 
evidence hardly permits any kind of 


JOHN H. FLAVELL AND JURIS DRAGUNS 


integrated theory of thought micro- 
development, it is at least possible to 
see some commonality and con- 
sistency in what has been said and to 
organize a series of very tentative 
statements about the topic—a sort of 
loose conceptual framework within 
which to think about the develop- 
ment of thoughts. In this hypotheti- 
cal account, we will lean most heavily 
upon Schilder’s writings but will also 
draw from the work of Rapaport et al., 
Jung, and the rest. 

First of all, thought in its early 
stages is global, diffuse, and undiffer- 
entiated in structure (131); that is, 
mental contents, be they images or 
imageless thoughts, tend to coexist 
without articulation and _ without 
clearly defined  interrelationships. 
These early thought elements may be 
vague, imageless thought tendencies 
concerning the task, the solution, the 
thinker’s relationship to task and 
solution, etc. Images, when they oc- 
cur in thinking, also tend to be early 
rather than late products and may 
serve as primitive and _ concrete 
anchoring-points or, as Schilder puts 
it (99), ‘“‘symbols’” for what is 
to come. Microgenetically early 
thoughts, imaginal or imageless, seem 
to have the quality of what Rapaport 
(83) has termed drive-representations, 
i.e., needs and affects are particu- 
larly sovereign in determining which 
thoughts push for expression, which 
thoughts feed into the developmental 
process. Moreover, the laws of com- 
bination and association of thoughts 
in the beginning phases likewise seem 
to resemble those posited for primary 
process thinking, i.e., association by 
contiguity, association by superficial, 
external similarity, association on the 
basis of common personal predicates 
and a prevalence of condensation 
and displacements (32). Thus the 
thought process tends first towards 
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this, then that premature, ‘“paleo- 
logical” solution (1) and early judg- 
ments of solutions tend to be primi- 
tive, dichotomous affairs framed in 
terms of me-not me, good-bad, etc. 
(99). In the later stages of develop- 
ment thought ordinarily becomes dif- 
ferentiated into various components 
and these components become logi- 
cally interrelated in the formation of 
the solution. Thought in the final 
phase is normally reality- rather than 
drive-oriented and the early non- 
logical thought developments have 
become aborted, as it were, and no 
longer influence the form of solution. 
It is very likely that, in most people 
under ordinary circumstances, this 
extraordinarily rapid developmental 
process does not become an object of 
awareness and the thinker is con- 
scious only of the completed thought.” 


IMPLICATIONS OF MICROGENETIC 
THEORY 

It is proposed here that the micro- 
genetic approach can be fruitfully ap- 
plied to the cognition (perception and 
thought) of pathological individuals 
under normal conditions and of nor- 
mal individuals under atypical, non- 
normal conditions. An attempt will 
be made to provide evidence that 
such atypical cognitions tend to 
manifest formal characteristics simi- 
lar to those already predicated for 
microgenetically incomplete cogni- 
tion. Such evidence would suggest 


In this connection Rapaport, Gill, and 
Schafer (84) state: 

“These preparatory phases are, in the aver- 
age subject, preconscious: however, in intro- 
spective and/or obsessive people, the inquiring 
examiner often obtains reports on what hap- 
pened in the brief interval between the stimu- 
lus- and  reaction-word—how definitions, 
images, clang and other deviant associations 
occurred and were rejected, though the result 
came quickly and as a ‘popular reaction’ "’ 
(p. 20). 
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the general hypothesis that most or 
all atypical cognitions, whether found 
in normal or pathological individuals, 
are special cases of normal, mature 
cognition in the sense that they are 
cognitive forms which have aborted 
prior to complete development. Thus, 
within this frame of reference, nor- 
mal cognition is not defined simply 
by the absence of nonnormal attri- 
butes nor is atypical cognition viewed 
as a unique, qualitatively distinct 
formation. Normal, logical cognition 
is seen as a_ microdevelopmental 
achievement of the organism and 
deviations therefrom as developmen- 
tal arrests. Such an approach, should 
the facts justify it, permits one to 
subsume a host of cognitive phe- 
nomena under one developmental 


theory and, at the same time, makes 
the study of the normal, prototypical 
microgenetic process something of 
considerable theoretical urgency. 

In surveying the evidence for these 


beliefs, our previous major break- 
down in terms of percepts versus 
thoughts will be abandoned; instead, 
we shall examine the findings topic by 
topic, drawing from whichever set of 
microgenetic hypotheses (perceptual 
or thought) best applies to the data 
at hand. 


Normals Under Atypical Conditions 


Distraction constitutes one set of 
conditions under which normal indi- 
viduals tend to produce cognitive re- 
sponses which could be called atypi- 
cal. There have been a few studies 
which have attempted to study dis- 
traction effects. Jung (51) and 
Speich (108), for example, both 
found that when Ss are asked to give 
word associations under distraction 
conditions the tendency is for super- 
ficial, external responses (clangs, 
word-completions, etc.) to increase. 
In another publication (52) Jung re- 
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ports an interesting early study by 
Stransky on the effects of an experi- 
mental condition similar to distrac- 
tion. His Ss were instructed to talk 
about anything for one minute with- 
out attending to what they were 
saying. He found that these instruc- 
tions produced an abundance of im- 
mature-like processes which included 
substitution of superficial connections 
(clangs, etc.) for logical ones, numer- 
ous perseverations, and fusions of 
competing verbal responses which re- 
sulted in neologisms and contamina- 
tions. Not all the evidence with re- 
gard to distraction effects is in accord 
with microgenetic theory, however; 
Magaret (11) failed 
to find such effects when distraction 
was superimposed on a task of com- 
pleting incomplete sentences. 


Cameron and 


There is some evidence pertaining 


to the formal characteristics of think- 
ing in dreams, daydreams, and semi- 
sleep. Freud (32), as is well known, 
characterized dream-thinking as be- 
ing replete with condensations, dis- 
placements, symbolization of ab- 
stract thoughts via concrete images, 
prelogical thinking mediated by ex- 
ternal and superficial or highly sub- 
jective similarity, ete. Varendonck 
(115), in his classical study of day- 
dreams, has likewise stressed the lack 
of criticality and logical direction 
and the important role of nonverbal 
imagery which obtains in ordinary, 
fantasy. Mintz (72), 
Rapaport (82), and Silberer (102) 
have described the hypnagogic or 
semisleep state in somewhat similar 
terms: decrease in reflective aware- 
ness, or sharply focussed self-critical- 
ity; symbolization (via images rather 
than words) of bodily states, atti- 
tudes, etc., as well as ordinary thought 
contents; and a tendency to substi- 
tute prelogical autistic thinking for 
logical, conventional thought. Jung 


conscious 
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(51) reports a study in which one S 
was given a word-association test 
both under normal waking conditions 
and under conditions of semisleep. 
The S, while drowsy, gave about 
seven times as many clang reactions 
as when in the waking state. 

There are a variety of studies de- 
scribing the effects of various drugs 
upon thought and perception. Smith 
(106) found that alcohol tends to in- 
crease the frequency of word associa- 
tions of Jung’s “outer’’ type. He did 
not report his results statistically but 
the senior author's recalculations of 
Smith's data suggest that this tend- 
ency was significant at about the 
p<.15 level of confidence. Both 
Woodworth and Schlosberg (138) and 
Kohs (57) allude to old studies by 
Kraepelin and students which 
suggest that caffeine tends to cause 
Ss to give more superficial word as- 
sociations. ‘There are a 
studies the 
acid 
(LSD), and other “psychotogenic”’ 
drugs on cognition (28, 37, 41, 42, 
43, 45, 63, 64, 69, 92, 110). Some, al- 
though by no means all, of these drug 
effects seem consistent with what we 
would consider to be the formal char- 
acteristics of microdevelopmentally 
early cognition. Thus Ss 
the influence of LSD or 
have been found to show, among 
other things: looseness of association; 


his 


number of 
describing effects of 


mescaline, lysergic derivatives 


under 
mescaline 


rhyming and punning; inability to 
follow a single train of thought with- 
out interpenetration and fusions with 
other thought sequences; predomi- 
nance of vivid imagery in thinking; 
and a general lability of percepts. In 
connection with the imagelike char- 
acter of thoughts, for example, some 
of Meadow’s Ss reported that they 
had to overcome the ever-present 
visual images in order to think ab- 
stractly (69). One of Guttman’s Ss 
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described this phenomenon as fol- 
lows: “Each word I thought was 
connected with a picture. This hin- 
dered my thinking, as the concrete 
pictures held me" (37, p. 213). 
Lindemann and Clarke (63), and 
Kubie and Margolin (60) have also 
suggested that other drugs, such as 
scopolamine, sodium amytal, nitrous 
oxide, and various barbiturates, pro- 
duce cognitive states essentially equi- 
valent to those previously described 
for the semisleep state. 

In addition to distraction, drugs, 
deviations from the waking state, 
etc., there are several other miscel- 
laneous conditions which deserve 
brief mention. According to Kohs 
(57), Aschaffenburg found the famil- 
iar increase in clang and completion 
responses when Ss were in a fatigued 
Bexton, Heron, and Scott (3) 
found that prolonged insulation of 
Ss from external stimuli caused an 
increase in directionless thought of 
the daydream type and a falling back 
upon extremely vivid imagery. Kline 
and Schneck (56) found more ‘‘as- 
sociative alterations’’ when Ss gave 
word associations while under hyp- 
nosis. Although it is not altogether 


state. 


clear from their paper, ‘associative 


alteration” appears to include Rapa- 
port, Gill, and Schafer'’s (84) distant 


and mildly distant categories pri- 
marily Finally, Gellhorn and 
Kraines (34, 35), in another word- 
study, report that ex- 
induced anoxia causes 
an increase in perseverations and un- 


association 


perimentally 


usual, irrelevant associations—a re- 
sult with McecFarland’s 
earlier findings on the psychological 
effects of oxygen deprivation (67). 
We have so far considered verbal 
cognitive behavior in normals under 
atypical organismic states. Also of 
interest from a microgenetic stand- 
point are subverbal or preverbal cog- 


consistent 
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nitive responses under more or less 
typical organismic states—namely, 
the kinds of cognitive responses found 
in studies of semantic conditioning 
(21, 85, 86, 87, 88, 89, 90, 91, 139) 
and subception (62, 66, 68, 121). In 
both semantic conditioning and sub- 
ception studies Ss evidence 
sort of “cognition”’ of stimuli (usually 
below the level of verbal report) by 
means of measurable electrodermal 
or salivary responses. 


some 


It is interest- 
ing to speculate as to whether these 
kinds of dim cognitions which “‘regis- 
ter’ only at the physiological level 
can be microgenetically 
early, primitive forms which do not, 
for one reason or another, attain 
conscious awareness. Some of the 
studies of semantic conditioning re- 
veal curious facts which might sug- 
gest this. Razran (90), for example, 
reports one experiment in which a 
salivary response was conditioned to 
a given word and then S was pre- 
sented with a variety of other words, 


considered 


each bearing a different relationship 
to the original stimulus word. As 
might be expected, synonyms, supra- 
ordinates and contrasts of the origi- 
nal word elicited salivary responses 
of fairly large magnitude. What was 
surprising, however, was that homo- 
phones (i.e., clangs) of the original 
word elicited salivary responses about 
as large as the more logically respect- 
able coordinates, part-wholes, whole- 
parts, and predicates, and of greater 
magnitude than subordinates, a high- 
ly logical category! 
will 
demonstrated 


Common sense 
Flavell (30) has 
experimentally, that 
normals do not consciously consider 


assert, and 


words related by sound similarity to 
be as similar in meaning as those re- 
lated in terms of any of the semantic 
categories mentioned above. Yet 
the various studies of semantic con- 
ditioning seem to indicate that we do 
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make generalizations about verbal 
symbols, at the physiological level, 
on the basis of physical as well as 
semantic similarity. It may indeed 
be, as Jung long ago suggested, that 
“every apperceptive process of an 
acoustic stimulus begins at the stage 
of clang-like apprehension,” and that 
such an apprehension somehow gets 
“recorded” in an immediate auto- 
nomic reaction but normally does 
not persist, in subsequent micro- 
development, as a conscious com- 
ponent of the final cognition. It may 
also be possible to view at least some 
aspects of the problem of subception 
in similar terms. Bruner and Post- 
man (7) some time ago offered an 
interpretation of perceptual defense 
and subception data in terms of 
levels of response. They suggest that 
generic and diffuse affective re- 


sponses may occur prior to, or at 
lower thresholds than, conscious cog- 
nitive responses pertaining to the 
specific nature of the stimulus. They 


also mention, in passing, the relation- 
ship of this view to the classical 
“stages of perception’ theories we 
have already reviewed. More re- 
cently, Lazarus (61) has offered a 
somewhat similar view as one pos- 
sible explanation of subception. He 
suggests that the autonomic nervous 
system may be capable of making 
global, all-or-none discriminations 
between “danger” and ‘‘no danger,” 
“shock” and “no shock,” etc., under 
stimulus conditions which are not 
adequate for precise differentiation 
of the more complex attributes of the 
stimulus. 
The really intriguing question 
which all this poses, of course, is 
whether or not so-called “uncon- 
scious” thinking and perceiving can 
be meaningfully framed within mi- 
crogenetic theory. One wonders 
whether the similarities which may 


JOHN H. FLAVELL AND JURIS DRAGUNS 


exist between unconscious cognition, 
as in dreams for example, and what 
we have termed microgenetically 
early cognition are merely coinci- 
dental. May it be that unconscious, 
primary process cognitions are those 
which begin to develop, make their 
mark on behavior, and then, for 
reasons which can only be guessed 
at, abort below the level of conscious 
awareness? Conrad (15), to whose 
work on microgenesis we shall shortly 
refer, proposed a very similar ex- 
planation. The recent experiments 
by Smith and Henriksson (104) and 
Klein et al. (55) also provide some 
support for such a conceptualization. 
In these studies it was demonstrated 
that stimuli flashed at tachistoscopic 
exposure times too brief for conscious 
recognition definitely modified the 
perception of other suprathreshold 
stimuli presented immediately after 
them. Klein (54, p. 23) makes one 
statement, in discussing his results, 
which well expresses the tenor of 
our Own musings: 

A working hypothesis in this situation is 
that the A figure, exposed for a few micro- 
seconds, starts a cognitive process which is 
interrupted or covered over so quickly by the 
B process that it is, in effect, aborted. Some 
kind of compromise formation results in the 
reported percept. Such incomplete formations 


may provide the condition for the operation of 
primary process mechanisms (italics ours). 


Pathological Individuals 


We shall confine our discussion in 
this section mainly to two diagnostic 
groups in which atypical cognition 
seems especially predominant—schio- 
phrenia and aphasia. Fairly ade- 
quate accounts of theory and re- 
search on schizophrenic cognition 
may be found in Arieti (1), Bellak 
(2), Cameron (10), Fenichel (27), 
Flavell (30), Goodstein (36), Kasanin 
(53), Wegrocki (123), and White 
(135). A study of the literature on 
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schizophrenic thought and perception 
reveals two facts of particular rele- 
vance to a microgenetic approach. 
The first of these pertains to what 
seems to be a rather striking simi- 
larity between microgenetically im- 
mature cognition and schizophrenic 
cognition. The senior author, for 
example, has elsewhere (30) summa- 
rized some of the alleged salient 
features of schizophrenic thinking 
roughly as follows: condensations, el- 
lipses, word salad, neologisms, clang 
associations, tangentiality, incoher- 
ence, word magic, ‘“‘paleological”’ 
thinking based upon logically super- 
ficial predicates of either external or 
inner-personal origin, excessive use 
of concrete symbolism, and others. 
Secondly, the so-called “regression” 
theorists, i.e., Arieti (1), Von Do- 
marus (120), Storch (111), Vigotsky 
(117), Werner (131), White (135) 
and various psychoanalysts (27), 
have related schizophrenic cognition 
to that found in normals under ab- 
normal conditions, in children, and 
in people belonging to “less ad- 
vanced” cultures. That is, they re- 
gard schizophrenic cognition as one 
instance of a more generic, primitive 
mode of cognition which is found in 
a variety of individuals under vari- 
ous conditions (30). Only Schilder 
(98, 99, 100) seemingly, has both 
stressed the formal similarities among 
primitive or regressive cognitive proc- 
esses of various kinds and also ex- 
plicitly taken the further step of 
viewing such processes as themselves 
possible instances of microgenetical- 
ly immature cognition. It is of inter- 
est to note that Schilder focused par- 
ticular attention on schizophrenia as 
the example par excellence of a con- 
dition in which early cognitive for- 
mations intrude into consciousness 
and get expressed as though they 
were completed thoughts. Two of 
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the most interesting recent studies 
pertinent to the problem of micro- 
genesis in schizophrenia are reported 
in the article by Phillips and Framo 
mentioned above (80). Rorschach 
cards were successively shown to 
schizophrenics and normals under in- 
creasing tachistoscopic exposure times 
and responses scored on a_ scale 
of amorphousness-specificity-differ- 
entiation and organization, devised 
by Friedman (33). As exposure 
time increased, the normals’ percepts 
tended to progress from an initial 
amorphousness and vagueness to 
specificity and integration; the schiz- 
ophrenics’ percepts, on the other 
hand, tended to remain at the initial 
undifferentiated level. Also worthy 
of mention are the word-association 
studies by Rapaport et al. cited 
earlier. They found that schizo- 
phrenics exceeded normals in_ re- 
sponses presumably indicative of an 
incomplete associative development, 
i.e., the various reactions classified 
as close or distant. 

We have stated that Schilder is es- 
sentially the only theorist who has 
systematically described schizophren- 
ic cognition in microdevelopmental 
terms. In this respect aphasia has 
fared somewhat better. As men- 
tioned earlier, Bouman and Griin- 
baum, Pick and Thiele, and Van 
Woerkom derived their conceptions 
of normal microgenesis directly from 
studies of thought and perceptior in 
Thus Bouman and Griin- 


aphasia. 
baum, for example, found that an 


aphasic patient had difficulty in 
coping with stimuli which required 
analyzing parts within a whole 
(e.g., a design embedded within two 
overlapping figures) but could ade- 
quately handle perceptual and con- 
ceptual situations in which only a 
diffuse, over-all apprehension or a 
total dichotomous judgment was re- 
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quired. From evidence of this kind 
Bouman and Griinbaum, Van Woer- 
kom, etc., drew two conclusions: 


first, the normal sequence of cogni- 


tion has a certain characterizable de- 
velopmental form; second, this de- 
velopmental sequence somehow gets 
arrested during its early stages in 
aphesia. Certainly the most vocal 
and explicit proponent of a micro- 
genetic interpretation of aphasic cog- 
nition has been Conrad (13, 14, 15, 
16, 17, 18). Conrad has systematical- 
ly applied the theoretical formula- 
tions of the Aktualgenese school to 
aphasic cognitions. He states that 
the normal process of cognition in- 
volves both progressive differentia- 
tion and integration of stimulus ma- 
terial and that in aphasia, one or both 
of these processes typically tends to 
be incomplete (14). Conrad describes 
four levels of disability which may 
occur: (a) normal Gestalt formation 
gets accomplished but with abnormal 
effort and tension; (6) the figure gets 
differentiated from background but 
does not itself become differentiated 
or structured; (c) figure and ground 
are not clearly articulated and the 
percept is vague and amorphous, as 
though presented tachistoscopically 
at very brief exposure times; (d) lack 
of any Gestalt formation of any kind 
(13). Conrad suggests that certain 
memory processes may also be con- 
sidered from a microgenetic stand- 
point (15). For example, he elabo- 
rates upon Wenzl’s (126) earlier ac- 
count of the process of word-finding, 
suggesting that, in attempting to re- 
member a forgotten word, we pass 
through successive stages structural- 
ly similar to those found in percep- 
tual microdevelopment. The se- 
quence of mnemonic reconstitution 
is the same for normals and aphasics, 
although, of course, the problem of 
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searching one’s memory for forgotten 
words may be an almost ever-present 
one for the aphasic. Worthy of cita- 
tion here also is the extensive work 
of Ombredane, whose findings like- 
wise support a microgenetic concep- 
tion of aphasic disorders (74). Wer- 
ner’s paper (132), a revised and ex- 
tended version of an earlier German 
publication (129), is the most recent 
exposition of an avowedly micro- 
developmental approach to aphasia, 
and perhaps the only one in the 
literature published in English. In 
this study, one of a series of pioneer 
investigations in the area of micro- 
development (127, 128, 130), words 
were presented repeatedly under 
gradually increasing tachistoscopic 
exposure times and normal Ss were 
asked to recount their perceptual ex- 
periences at each exposure until full 
recognition was achieved. Werner 
found that a number of his Ss re- 
ported experiencing spheres of mean- 
ing prior to specific and complete 
recognition of the word stimuli. For 
example, some Ss would experience 
‘feelings’ about the as yet undis- 
criminated stimulus, word—feelings 
that it is “warm,” “vibrating,”’ 
“soft,” etc. Also, Ss would occa- 
sionally get a global impression of 
the domain or class within which the 
word belongs (“it is something shin- 
ing,’ etc.). Werner then describes 
highly similar spheric experiences 
reported by aphasics in the course of 
attempting to name familiar objects, 
read or grasp the meaning of familiar 
words, and so on. He suggests that 
in such cases the patient's overt re- 
sponse is the result of a premature 
precipitation of spheric experiences 
into verbal expression, i.e., a micro- 
genetic abortion of the kind Conrad 
and others have described. In the 
remainder of his paper, Werner dis- 
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cusses some interesting implications 
of this view for the re-education of 
patients with aphasic disorders. 


FUTURE PROBLEMS 


We have discussed some of the 
evidence pertaining to a microgenetic 
approach to cognition and some of 
the possible applications of this ap- 
proach to various cognitive states in 
normal and pathological individuals. 
Other possible extensions could be 
delineated. For example, formal re- 
lationships between microgenetically 
immature cognition and cognitive 
functioning in nonaphasic _ brain- 
damaged cases, aments, depressives, 
manics, and normal children have 
not been discussed, although there 
is some evidence which might sup- 
port some such comparisons (26, 39, 
46, 79, 84, 133). Also, possible rela- 
tionships between personality vari- 
ables and microgenetic sequences 
need exploration. It is interesting to 
note that members of the Aktual- 
genese school were actively con- 
cerned with correlating individual 
differences in microdevelopmental 
sequence with “personality types”’ 
(23, 38, 40) and that Sander himself 
thought of microgenesis as a potential 
avenue for the exploration of the un- 
conscious (94, 95). A recent study by 
Smith and Klein (105), although 
concerned with somewhat more ex- 
tended cognitive sequences than 
those we have been considering, is 
also relevant to the problem of mi- 
crogenesis-personality relationships. 
However, the most extensive and per- 
haps the most intriguing investiga- 
tions in this area are those recently 
described by Kragh (58). This Swed- 
ish psychologist has formulated a 
bold and explicit personality-percep- 
tual microgenesis theory and has re- 
ported a series of tachistoscopic ex- 
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periments which purport to show 
relationships between the ontogenesis 
of personality and the microgenesis of 
percepts. As a final extension, one 
can speculate with Werner (132) as to 
whether such functions as memory 
and motor performance—as well as 
perceptual and conceptual develop- 
ments more complex and of longer 
duration than the ones considered in 
this paper—typically undergo devel- 
opmental sequences similar in formal 
aspects to those already described. 
Such questions, however, seem 
somewhat premature at present in 
that they assume a more complete 
factual knowledge of the prototypical 
microgenetic processes than we now 
possess. A problem of much higher 
priority concerns whether, and by 
what means, the nature of these elu- 
sive processes themselves can be ex- 
perimentally elucidated. With regard 
to perceptual microdevelopment, it is 
clear that more adequately designed 
studies of the formal aspects of 
genetic sequences are needed. For 
example, it would be possible to 
avoid the hazards of relying solely 
upon verbal report in tachistoscop- 
ic studies by requiring artistically 
trained Ss to draw rather than de- 
scribe their percepts at each exposure 
time level. It should then be possible 
to study sequences by having the 
drawings categorized by judges as to 
such formal features as difluseness, 
degree of figure-ground articulation, 
etc. Such a study would perhaps lay 
claim to greater objectivity than 
those hitherto reported. Likewise, 
for the development of thoughts or 
concepts, a plausible experimental 
technique might be that of motivat- 
ing Ss to produce word associations 
under extreme time pressure and 
comparing the formal aspects of the 
resultant associations with those pro- 
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duced by Ss who had not responded 
under pressure. Techniques of this 
type have been used with some suc- 
cess in the past (19, 51, 71, 101, 108). 
Both of the above methods, or modi- 
fications thereof, could of course also 
be used with nonnormal populations 
in order to study regressive cognition 
within a microdevelopmental frame- 
work. 

In concluding, it is perhaps ap- 
propriate to underscore the consider- 
able problems which confront the 
microgenetic approach in its current 
form. In the first place, the ab- 
stractness, looseness of logical struc- 
ture, and general semantic impreci- 
sion which characterizes present-day 
microgenetic theory may be in part 
responsible for the ease with which 
it seems to subsume so many diverse 
cognitive phenomena. Such a criti- 
cism implies that as the conciseness 
and testability of the theory in- 
creases, nature will seem less coopera- 
tive and problems of generalization 
will Likewise, at the data 
level, it must be apparent that the 
findings on the basis of which micro- 
genetic hypotheses have been con- 
structed are by no means gilt-edged. 
lor example, many of the studies 
cited stem from an era when care- 
ful experimental control could hardly 
be called the rule. Perhaps a more 
serious criticism pertains to the 
nature of the typical experimental 
operations by which microgenesis is 
allegedly demonstrated. It could be 
argued, for instance, that the fact 
that an S might, under time pressure, 
produce responses classified within 
the theory as microgenetically un- 
developed does not prove conclu- 
sively that such responses really ‘‘oc- 
cur’ but are suppressed in the nor- 
mal, unhurried associative process. 
It is certainly possible to pose alter- 
native explanations in terms of 


arise. 
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variations in set or alterations in 
verbal habit-family hierarchies in- 
duced by time pressure. Similarly, 
there is no absolute proof that the 
sequence of percepts found when the 
tachistoscopic method is used is a 
faithful reflection of the natural proc- 
ess of percept development. Perti- 
nent criticisms of this order have 
been raised by Weinschenk (124, 
125), and Klein.* It is true that one 
can counter such objections with 
logical arguments and by citing in- 
trospective evidence, such as the 
verbal reports of Rapaport’s obses- 
sional group mentioned earlier (Foot- 
note 2). Nonetheless, such objections 
have real force and the experimentum 
crucis which would settle the matter 
is difficult to conceive at present. For 
us the microgenetic interpretation 
has led to a fresh, albeit highly specu- 
lative, view of a variety of cognitive 
phenomena and has suggested certain 
lines along which research might pro- 
ceed. We are thus inclined to tolerate 
its ambiguities for a time out of sheer 
curiosity to see what will come of it in 
the future. 


SUMMARY 


The present paper has proposed a 
microgenetic approach to perception 


and thought. Within this approach, 
thoughts and percepts are believed 
to undergo a very brief, but theo- 
retically important, microdevelop- 
ment. Evidence was offered both to 
support the possibility that such 
microdevelopments do occur in the 
normal process of thinking and per- 
ceiving and to suggest some of the 
formal characteristics of such evo- 
lutions. Further, an attempt was 
made to delineate some of the pos- 
sible implications of this approach for 
cognitive functioning in abnormal 

* Klein, G. S 
May 28, 1956. 


Personal communication, 
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Individuals and in normal individuals 


problems and future research possi- 
under atypical conditions. Finally, 


bilities in relation to a microgenetic 
framework. 


consideration was given to current 
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The writer became interested in 
Sheldon’s physical and temperamen- 
tal types (10, 11) because they have 
been so widely, and frequently so 
favorably, discussed in recent years. 
Relatively little investigation was 
needed in order to discover that the 
favorable discussions had little foun- 
dation in fact for the attitude ex- 
pressed and that the use of Sheldon's 
types in further research should be 
discouraged. 

In the course of this investigation 
interest was aroused concerning type 
theory generally. The conclusions 
reached have implications beyond 
the Sheldon types. It is believed, in 
brief, that traditional type? theories 
have important characteristics in 
common that arise inevitably from 
the definition of type. It is further 
believed that characteristics 
make type concepts unsuited for most 


these 


research purposes. 
Organization of Discussion. A brief 


Paper completed while serving as visiting 
professor, University of Illinois, fall semester, 
1955, and while on leave from Personnel Lab- 
oratory, Air Force Personnel and Training 
Research Center, Lackland Air Force Base, 
San Antonio, 
sions expre ssed herein are those of the author 


Texas. The opinions or conclu 
These are not to be construed as necessarily re 
flecting the endorsement of the Department of 
the Air Force or of the Air Research and De- 
velopment Command 

2 The reader should beware reading into this 
discussion his own connotations with types 
\s the argument develops, it will be seen that 
a rather special definition of type emerges 
This definition best characterizes Sheldon's 
but it is believed that it is generally 
such as 


t) pes 
appli ible to his 
Kretschmer, as well. 


predec essors, 
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review of Sheldon’s work will first be 
presented, including the logical-sta- 
his data that led 
the writer to reject his concepts. This 


tistical analysis of 


will be followed by a discussion of the 
characteristics of type concepts and 


the similarities of types to ipsative 


(or relative) scales. Types, and ipsa- 
tive scales, will then be contrasted 
with traits, or normative scales, and 
the applications of each 
Finally the application 
of the multiple discriminant function 
to problems that in the past led to 
efforts at typing will be briefly de- 
scribed. 


in research 
pointed out. 


REVIEW AND ANALYSIS OF 
SHELDON’S CONCEPTS 

Sheldon has described three physi- 
cal ty pes: endomorphs, characterized 
by visceral development; meso- 
morphs, characterized by skeletal and 
muscular development; and ecto 
morphs, characterized by neural de- 
velopment. Each of these three ty pes 
can be reliably rated on a se ven-step 
scale for every individual Precise 
physical measurements can be used as 


He has 


temperament 


the basis for these ratings. 


also described three 
types; visc erotonic, somatotonic, and 
Each individual can 


also be reliably rated on a seven 


cerebrotonik 
step 
scale for each of the three tempera- 
These ratings are ob- 
60-item 


ment 
tained 


types 


from a rating scale 
divided into three clusters of 20 traits 
each. An individual's type scores are 
conventionally written as three num- 
bers, each having a theoretical range 


from one through seven, separated by 
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7-1-1, 4-4-3, 2-2-6, etc. 
Data are also presented apparently 
showing that physique and tempera- 
ment are opposite sides of the same 
the correlations between 
the logically related types of phy- 
sique and temperament are all in the 
neighborhood of .80. 

Usual assessment of Sheldon’s con- 
tribution. The Shel- 
don’s contribution to typology is fre- 
quently divided into two parts. In 
the first place, he is credited with the 
introduction of quantification in typ- 
ing procedures. The attitude taken 
in this paper, however, is that there 
is no virtue in quantification if there 
is no justification for the variables 
measured. Thus, thorough investiga- 
tion of the origin and characteristics 
of his variables is indicated. In the 
second place, Sheldon is credited with 


hyphens; e.g., 


coin; 1.e., 


assessment of 


obtaining the most substantial rela- 
tionships yet obtained between phy- 
sique and temperament. In evaluat- 
ing this contribution, it is 
more important to evaluate the con- 
trols used in obtaining the measures 
correlated than it is to compute the 


Sct ond 


standard errors of those correlations. 

Analysis of Sheldon's types of phy- 
sigue. The error involved in accept- 
ing Sheldon’s work at face value be- 
comes apparent when his procedure 
is reviewed. First, with regard to es- 
tablishing the physical types, it is 
clear that the procedure was not em- 
pirically sound. The types originated 
Sheldon did have 
large numbers of photographs spread 
out before him when he selected the 
types, but that hardly makes the pro- 
If Thurstone had 
spread 56 printed tests before him- 
decided what factors were 
needed to describe performance on 
these tests, he would have produced a 


in the arm chair. 


cedure empirical. 


self and 


set of human abilities with about as 
much justification as Sheldon has for 
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his physical types. In a situation of 
this sort, it is unlikely that the ob- 
server would find much beyond what 
he expected to find. Any similarity 
between the Sheldon and Kretschmer 
types is certainly not coincidental, 
and does not necessarily mean that 
Kretschmer was groping in the right 
direction in a primitive sort of way. 
An analysis of the intercorrelations 
of Sheldon's types will also contribute 
to the evaluation of his concepts. 
Sheldon (10) has published the inter- 
correlations of his physical types, 
based on two samples of 2,000 and 
200 cases respectively. He also in- 
cluded, in an appendix, data for 4,000 
cases. In order to obtain greater sam- 
pling stability, correlations were com- 
puted for the sample of 4,000 cases by 
the writer. 
Table 1. 


These are presented in 
In comparing them with 
TABLE 1 


INTERRELATIONSHIPS OF PuysicaL Tyres 


Endo Meso Eeto 
morphy morphy morphy 


765 — 300 402 


Endomorphy 
Mesomorphy 818 576 


Ectomorphy 833 


Note.—-Intercorrelations of types are presented in 
the usual fashion. Multiple between each 
type and the other two are listed in the diagonal 
N 4,000, 


correlat 


those published by Sheldon (the sam- 
ple of 2,000 cases was presumably in- 
cluded in the 4,000), an additional 
advantage of the new computations is 
discovered: what appears to be a 
computational error in the published 
value for the correlation between en- 
domorphy and ectomorphy in the 
sample of 2,000 cases is corrected. A 
value of 27 is improbably low in 
comparison to the value of 
—.40. It might also be noted here 
that other errors have been found in 
Sheldon’s computations (7). 


present 


Table 1 also includes, as the diag- 
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onal entries, the multiple correlations 
between each possible pair of types 
and the third. It is seen that the mul- 
tiples are in every case much higher 
than the zero-order correlations, but 
cannot be said to approximate unity. 
Sheldon makes a good deal of this 
“thickness,”’ i.e., evidence for three- 
dimensionality. Before accepting 
this evidence for three-dimension- 
ality, however, other factors must be 
considered. Ekman (4) has also con- 
sidered these factors in evaluating 
the claim for three dimensions. The 
present development differs from his, 
particularly with respect to the use 
of the multiple correlation technique. 

The scatter plots presented by 
Sheldon not only represent negative 
correlations, but show evidence of a 
good deal of curvilinearity as well. It 
is possible to correct for some of this 
distortion. The transformations 
shown in Table 2 were obtained by 
estimating the amount the scales 
needed to be “‘stretched” at the high 
end in order to convert the curvilinear 
regressions to something approaching 
linearity. The rationale for a correc- 
tion of this type, as Sheldon has 
stated, may be the lack of linear rela- 
tionship between stimulus units and 
judgments of equal increments. The 
intercorrelations were then recom- 
puted, and new multiple correlations 
determined. These are presented in 
Table 3. As compared to Table 1, a 
gratifying increase in the multiples is 


LLOYD G. HUMPHREYS 


rABLE 2 


COMPARISON OF SHELDON'S SCALES WITH THE 
CONVERTED SCALES 


Endomorphy Mesomorphy Ectomorphy 


Shel- 
don'’s verted don's 
Scale Scale Scale 


- 10 


Con- Shel- Shel- Con- 
don’s verted 


Scale S ale 


Con 
verted 
Scale 


“MV 


7 
6 
5 
4 
3 
2 
1 


mew Se wnn~ oO 
wa ; 
mw e UO oO 


Note.-The new scales, determined by inspection, 
were designed to make the regressions of each type on 
the others more nearly linear 


evident, but one is still uncertain 
whether any one variable iscompletely 
determined by the other two. 

There are other legitimate correc- 
tions that can be applied as long as 
we are interested in the problem of 
intrinsic relationships among the 
types. Certain errors of measure- 
ment, which are involved in the de- 
termination of physical type, and er- 
rors of grouping, since the scales are 
not in actuality continuous, also at- 
tenuate the relationships obtained. 
By applying Shepherd's correction to 
the standard deviations before com- 
putation of r, correction was made for 
the second of these attenuating fac- 
tors. These relationships are pre- 
sented in Table 3. Assuming relia- 
bility coefficients of both .95 and .97, 


TABLE 3 


Tyre INTERRELATIONSHIPS AFTER CORRECTION 
FOR CURVILINEARITY AND DISCONTINUITY 


Endomorphy 


Endomorphy .835/.901 
Mesomorphy — .334 
Ectomorphy — 424 


Mesomorphy Ectomorphy 


— .347 — 444 
.870/ .924 — 613 
— .586 881/.931 


Note. Correlations involving the converted scales are below the diagonal. Correlations above the diagonal 


were computed after applying Shepherd's correction to the standard deviations 
with the higher of the two representing the relationships obtained after making both corrections 


Multip e 


are in the diagonal 


N ~4,000 
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TABLE 4 


Type INTERRELATIONSHIPS AFTER CORRECTION FOR UNRELIABILITY 


Endomorphy 


.953/ .989 
— .358 
— .458 


Mesomorphy 
Ectomorphy 


Mesomorphy Ectomorphy 


— .365 - 
.964/ .992 
— .632 


467 
— 645 
.968/ .993 


Note.—The data, after correction for attenuation, are presented as in Table 1. Reliabilities were assumed to be 
first, at the .95 level and, second, at .97. Values based on the former assumption are above the diagonal; others. 


below 
N =4,000 


the values of r in Table 3 were cor- 
rected for errors of measurement, and 
new multiples were computed. These 
results are presented in Table 4. With 
assumed reliabilities of .95, the evi- 
dence for three-dimensionality com- 
pletely disappears. The higher val- 
ues leave little room for a third di- 
mension. We can conclude that Shel- 
don has evidence for no more than 
two independent (not necessarily 
valid) types of human physique. Ek- 
man’s conclusion is thus thoroughly 
substantiated. 

Origin of the temperament types. 
Sheldon’s procedure in establishing 
the temperament types is also sub- 
ject to criticism. Sheldon states that 
in selecting the 20 traits used to de- 
scribe each of the three types he used 
a procedure similar to factor analysis. 
An impartial critic would prefer the 
term ‘“‘cluster analysis,’ and one 
would add “‘statistically naive’ as 
well. Sheldon’s statistical criteria for 
trait selection were as follows: cor- 
relations of at least +.60 between all 
traits within each cluster, and corre- 
lations of at least —.30 with ail traits 
in the other two clusters. He states 
that on a priori grounds he expected 
to find four clusters and was sur- 
prised to find only three. The a priori 
reasons were evidently not statistical 
in nature. As long as adequate num- 
bers of cases were used (to avoid 
gross sampling errors), his criteria for 
selection made it statistically impos- 


The separate sets of multiple correlations obtained are again in the diagonals and are approximately unity 


sible to obtain more than three clus- 
ters of traits. Furthermore, it became 
equally certain that any two would 
approximately determine the third. 

The statistical argument here is 
simple. Since each trait in a cluster 
had to be correlated to the extent of 
at least —.30 with every trait in the 
other clusters, the mean correlations 
between single traits in separate clus- 
ters must be substantially grgater 
than —.30. The correlations between 
clusters will be still greater since sums 
of 20 traits are correlated. It is esti- 
mated, on the basis of the formula 
for the correlation of sums, that the 
correlations between clusters would 
approach —.50. It is statistically im- 
possible to find more than three vari- 
ables with intercorrelations in this 
neighborhood, since at this point the 
multiple correlation between any two 
and the third is unity. For four vari- 
ables, the intercorrelations would 
have to be aslow as — .333, an obvious 
impossibility starting with the a 
priori criteria of trait selection used 
by Sheldon. 

Correlations published by Sheldon 
(11) for a sample of 200 cases for the 
temperament types are presented in 
Table 5. Multiples, compiled by the 
writer, appear in the diagonal as be- 
fore. These are sufficiently high that 
it seems useless to go through the 
series of corrections made on the data 
for physical types. The less reliable 
nature of the temperament ratings is 
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TABLE 5 


INTERRELATIONSHIPS OF TEMPERAMENT 
TyPEs 
Somato- 
tonia 


Viscero- Cerebro- 


tonia tonia 


— .34 — .37 
873 — .62 
.875 


Viscerotonia 815 
Somatotonia 
Cerebrotonia 


Note.—-Intercorrelations of the types, taken from 
Sheldon, are presented in the usual fashion. Multiple 
correlations between each type and the other two are 
listed in the diagonal. No corrections have been applied 
to the data. N «200 


in itself probably sufficient to account 
for the obtained values being less 
than unity. We can safely conclude 
that Sheldon has evidence for no more 
than two independent (not neces- 
sarily valid) types of temperament. 
Physique-temperament correlations. 
The published correlations relating 
the physique and temperament types, 
while undoubtedly higher and more 
stable from the sampling point of 
view than any others in the literature, 
are basically defective. Several re- 
viewers who should have known bet- 
ter have disregarded the fact that the 
same person (Sheldon) made the rat- 
ings of both temperament and phy- 
sique. Sheldon at least recognized the 
danger in this procedure, but dis- 
counted it for two reasons: he states 
that he recognized the difficulty at 
the time the ratings were being made, 
and the ratings of temperament pre- 
ceded the ratings of physique. These 
arguments are not convincing. The 
relationships in question could be 
completely invalidated by this aspect 
of the procedure. The only legitimate 
conclusion to be drawn concerning 
these relationships is “not proven.” 
Let the reader reflect for a moment 
how the data from an analogous situ- 
ation would be received by a group 
of biologically oriented psychologists. 
A social psychologist has an hypothe- 
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sis concerning the effects of demo 
cratic and autocratic home atmos- 
phere on the development of a cer- 
tain personality trait. 
involves a correlation between rat- 
ings of homes and ratings of the per- 
sonality trait, both made by himself. 
A high correlation is found. The 
analogy seems sufficiently close, and 
the conclusion so apparent, as to need 
no further comment. 

Sheldon’s more recent report on 
juvenile delinquency (12) also shows 
evidence of inadequate control. The 
conclusion that physical type is 
highly related to juvenile delinquency 
is based on a comparison of his de- 
linquent sample with his college un- 
dergraduates. 

Evaluation of Sheldon's typology. 
Sheldon’s claims for having estab- 
lished relationships between  phy- 
sique and temperament are thus 
“thrown out of court” for lack of evi- 
dence. More basic, however, is the 
doubt cast on the validity of his type 
concepts. His temperament types 
were arbitrarily determined by the 
statistical criteria. His physical types 
arose from the arm chair and were 
undoubtedly influenced by the same 
line of statistical reasoning 


His research 


Research 
workers, if they wish to make use of 
Sheldon's types, are advised to dis- 
card one physical type and the cor- 
responding temperament type. This 
would result in savings of measure- 
ment time and statistical analysis of 
data. If multiple regression analysis 
is planned, however, the recom- 
mended procedure becomes compul- 
pulsory. Beta weights can be reliably 
determined on only two of three mu- 
tually dependent variables. 

Even if the research worker in this 
field discards one of the three types, 
he can still have no confidence in the 
meaningfulness of the two retained. 
This is not to say that empirical rela- 
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tionships with the types cannot be 
obtained, though Eysenck’'s review 
(5) indicates that few have been es- 
tablished. The more careful investi- 


gation of factors in delinquency by 
Glueck and Glueck (6) does indicate 
nonchance relationship with 
body build. The possibility remains, 
and will be discussed in greater detail 
later, that a more sophisticated ap- 


some 


proach to the problem would produce 
more substantial relationships in 
those cases where some relationship 
has been shown. 

THe Locic or Type VARIABLES 

With the completion of these sta- 
tistical analyses of Sheldon’s types, 
the writer became interested in the 
logic of type variables as they have 
been used historically by Kretschmer 
and others. This logic apparently ex- 
plains some of Sheldon’s mistakes, 
i.e., his interest in negative correla- 
tions. It also furnishes reasons why 
a priori types should be discarded. 
The argument here will again be 
found to parallel in part a theoretical 
development of Ekman (3). The lat- 
ter did not see, however, that his rea- 
soning applied as well to Sheldon as 
to Kretschmer. 

Definition of type. A type has tra- 
ditionally been defined in terms of an 
ideal person. A type score is the de- 
gree to which a given individual ap- 
proaches the ideal. Ideals (types) are 
defined in terms not only of the pres- 
ence of certain traits to high degree, 
but also of the virtual absence of all 
other traits. The description of a sec- 
ond ideal (type) will involve high 
scores on certain traits and low scores 
on others that entered the description 
of the first ideal (type) in opposite 
degree. Negative correlations among 
types naturally follow, and the smaller 
the number of types deemed neces- 
sary to encompass the range of hu- 


man differences, the higher will be 
these negative correlations. 

The scatter plots of the correlations 
among Sheldon’s types are of interest 
in this regard beyond the evidence 
for curvilinearity of regression dis- 
cussed earlier. These plots take the 
form of a “T”’ tilted to the right; i.e., 
there are no entries above the upper 
left or lower right diagonals. Two 
high type scores, e.g., 7-7-1, 7-1-7, or 
1-7-7, are impossible because two 
ideals cannot both be approximated 
in one individual. A maximum score 
for one type also assures two mini- 
mum scores for the other types, e.g., 
7-1-1, 1-7-1, or 1-1-7, since the one 
high score means that the individual 
is low on all the other traits that in 
various combinations determine the 
remaining types. 

The definition of type thus far 
evolved is, however, lacking in one 
particular. It does not explain why 
three low type scores are not found, 
nor why one average score is always 
accompanied by at least one other 
average The definition does 
not account, in other words, for the 


score. 


fact that the three type scores add 
up to a constant, as has been shown 
earlier. 

It seems to have been assumed by 
Sheldon that the definition of the 
ideal (type) should be in relative 
terms. Approximation to the ideal 
mesomorph, for example, does not de- 
pend on absolute height or weight, but 
is a function of relative bodily pro- 
portions. Defined in this way the 
physique of every individual is com- 
pletely described by the types se- 
lected. By definition there 
1-1-1 individuals. There is nothing 
remarkable in the fact that certain 
combinations of scores do not occur 


are no 


in nature as Sheldon implies. Certain 
combinations are prohibited by the 
nature of the concepts selected to de 
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scribe human physique or tempera- 
ment. 

Examples of types. It may be use- 
ful at this point to give several ex- 
amples of the operation of type con- 
cepts. These illustrations will serve 
to clarify further the definition of 
type. They will also serve as the 
main argument concerning the arbi- 
trariness of number of types used by 
any one theorist. 

Let us suppose that there are 
twelve observable human mental 
abilities. It would be possible for a 
type theorist in viewing this particu- 
lar range of human differences to 
establish arbitrarily only two ability 
types. One could, for example, speak 
of the intellectual and mechanical 
types. The first would be defined by 
the presence of high scores on about 
half of the 12 abilities, low scores on 
all of the rest. High and low are of 


course defined relative to the person's 
own mean. Any exception to this pat- 
tern would reduce the size of the type 


score, i.e., the perfect intellectual 
type is low in any trait required for 
mechanical occupations. The second 
type would be defined by the oppo- 
site combination of abilities; i.e., low 
on the group where intellectuals are 
high, high on all of the rest. The re- 
sulting correlation between the two 
types would be —1.00. Note that 
every individual can now be placed 
at some point along each of these 
type scales, but that knowledge of 
one determines the other, and that 
as long as low and high are rated rela- 
tive to the individual's own profile of 
abilities, the sum of the two type 
scores will be a constant for all in- 
dividuals. 

Three ability types could equally 
well be established. One type could 
be defined in terms of high scores on 
about a third of the abilities, low on 
the rest. The other two types would 
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be defined in a similar manner. In 
this case the correlations between the 
three types will be of the magnitude 
of —.50. Again, all individuals can 
be placed at some point along each of 
these scales, with any two defining 
the third. Three types now encom- 
pass the entire range of human abil- 
ities. 

This process could obviously be 
continued until 12 types were defined. 
We would still find negative correla- 
tions between types, averaging about 
—.09 for this number of types. 
Twelve types would also make possi- 
ble a larger number of combinations 
of type scores, including several 
fairly high scores for any one indi- 
vidual. We would still find, however, 
that the top possible score for one 
type would force the other scores to 
minimum levels. In general, the same 
biasing factors would be present, but 
their force would be somewhat dissi- 
pated by the larger number of de- 
grees of freedom available. Note 
that, no matter what the correlations 
might be between the trait measures, 
from past experience we know that 
there would be a range of positive 
values—-the correlations among the 
type scores would necessarily be nega- 
tive because of the way in which 
types are defined. 

It will be remembered that Sheldon 
strove for negative correlations among 
his types of temperament and that he 
was pleased with negative correla- 
tions among his types of physique. It 
is now seen that he was following the 
logic of the traditional type concept. 
Seeking high negative correlations 
automatically produces a small num- 
ber of types. Using relative standards 
ensures that everyone will have a 
high score some place. These charac- 
teristics, combined with a presumed 
high degree of generality in explain- 
ing human behavior (as a matter of 
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fact the presence of a pigeonhole for 
everyone is frequently assumed to 
constitute evidence for the general- 
ity), make type concepts well nigh ir- 
resistible for the clinically oriented 
person. Although Sheldon’s predeces- 
sors did not quantify their types, 
their theories had basically these 
same characteristics. 


SUBSTITUTES FOR TYPES 


At the point where there are as 
many types as there are measured 
traits, types become simply ipsative 
(1) or relative (2) scales. Score val- 
ues are obtained with reference to the 
person's own mean. Another com- 
mon way of stating the same thing is 
that level has been removed from the 
profile. Sheldon’s types could also be 
characterized as ipsative scales, al- 
though they differ from the usual 
scales of this sort in their complexity, 
i.e., a single type is defined by many 
facets of the person. Otherwise the 
parallel is complete. Intercorrela- 
tions among any number of ipsative 
scales will tend to be negative as long 
as these scales are obtained from a 
common score matrix. The size of the 
negative correlations will be a func- 
tion of the number of ipsative scales. 
No one can be good on everything on 
such scales. Everyone will be high in 
something, low in something else. All 
persons’ scores will add to a constant 
which will be equal to the amount, 
added to a standard score of zero, 
found necessary to avoid negative 
scores on any one scale. 

Traits, in contrast to types, have 
historically been associated with nor- 
mative scales. The contrast between 
the functional characteristics of traits 
and types, or normative and ipsative 
scales, is marked. A proposed trait 
measure may have correlations with 
other normative scales ranging from 
plus to minus unity. The average 
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intercorrelation of a group of trait 
measures may be anything in the 
same range. As long as correlations 
with other measures are not unity, all 
possible combinations of trait scores 
will appear. Relative frequencies of 
such combinations will vary, of 
course, but only as a function of the 
correlations between the scales and the 
shapes of the marginal distributions. 
The contribution to variance of across- 
the-board differences between indi- 
viduals can be large or small relative 
to the differences within individuals. 
No matter how low the intercorrela- 
tions of trait measures are, however, 
there will be some few persons low on 
everything, others high on every- 
thing. The statement found in many 
elementary texts, “correlation not 
compensation,”’ holds for traits; but 
the opposite statement, ‘‘compensa- 
tion not correlation,”’ holds for types. 

Another difference between traits 
and types is that the trend in trait 
measurement has been toward more 
specificity. Thus the complex trait 
measure of general intelligence has 
been giving way to factor measures of 
separate aptitudes. If type-like scales 
are desired for research purposes, it 
would be useful to give up complex 
types such as those of Sheldon and 
use specific ipsative scales in sufficient 
number to cover the area of interest. 

Choice of scale. For prediction pur- 
poses an investigator must choose the 
type of scale which is fitted for the 
problem at hand. In most cases this 
will be a normative scale. Most pro- 
ficiency criteria, for example, are 
themselves normative. It is highly 
doubtful that Sheldon’s types, or 
other more carefully selected ipsative 
scales, can predict athletic achieve- 
ment as well as normative scales of 
physique. One might be able to de- 
fine the line-backer type, but if a 
given example weighed 120 pounds he 
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probably would not be suitable ma- 
terial for the college team. In the 
same way, normative aptitude scales 
are better predictors of academic 
achievement than ipsative scales. 
The writer is not at all certain that 
there are occasions when an ipsative 
scale would be preferred. This possi- 
bility should not be ruled out, how- 
ever, without thorough exploration. 
A good bet for the tryout of ipsative 
scales is in the prediction of decisions. 
Such criteria 
from the 
(traits) 


to result 
tendencies 
individual, not 


would seem 
balancing of 
within the 


from his standing in a group on the 
several traits.’ 

It was mentioned earlier that Glueck 
and Glueck had found a nonchance 
relationship between somatotype and 
delinquency. It is possible that this 


criterion is also basically ipsative. 
But whatever the nature of the cri- 
terion, it is highly probable that 
differentiation could have 
been accomplished by an empirical 


greater 


combining of several specific meas- 
ures The discriminant function 
could be applied to either ipsative or 
normative scales, or a combination of 
the two. 
would be made empirically in terms 
of the differentiation obtained. 

The multiple discriminant function. 
As a matter of fact, the discriminant 
function, or better, the multiple dis- 
criminant function (13, 14), is a logi- 


The choice between scales 


* It should be noted that tryout of ipsative 
scales has value only for theoretical purposes. 
William V. Clemens has shown in an unpub- 
lished report from the University of Washing- 
ton (October 1956) that the group of norma- 
tive scales from which the ipsative scales are 
developed will always give a multiple correla- 
tion with an outside variable as high or higher 
than the latter. Thus from the point of view 
of predic tion, ipsative scales are unnecessary, 
Appropriate combinations of positive and neg- 
ative weights of the normative scales can al- 
ways accomplish the same result. 
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cal technique to substitute for typing 
procedures. Is person A most like the 
average artist, salesman, physician, 
engineer, or lawyer? Is person B most 
like runners, jumpers, or shot-put- 
ters? Is person C most like a brain 
injury case, a schizophrenic case, or a 
psychopathic deviate? Although a 
given discriminant may have charac- 
teristics reminiscent of types, there is 
a basic difference—it is formed to 
answer a specific problem. One would 
rarely if ever wish to use a discrim- 
inant successful in one area of re- 
search for another problem. Trying 
out complex types in each new prob- 
lem area is equally unjustified and is 
equally to be discouraged. 

Steps in scale development. It is 
clear to the writer that methodologi- 
cal research on scale development 
should proceed in accordance with a 
logical order, which applies to phy- 
sique as well as to interest, tempera- 
ment, and aptitude. 

1. The first priority should be 
given to the development of ade- 
quate normative scales. These scales 
should be fairly specific and relatively 
homogeneous, though the scalability 
criterion of homogeneity should not 
be generally applied. Requiring too 
high a degree of homogeneity results 
in too many scales giving too little 
useful differential information. The 
direct method of factor analysis may 
be a useful tool in the development of 
the required normative scales. 

2. Ipsative scales should usually 
follow the normative. For one thing 
this would allow a rational choice of 
a normative group of scales for de- 
velopment of the ipsative scoring. 
Mainly, however, one must know the 
characteristics of the normative scale 
before the ipsative scale can be given 
meaning. A possible exception to this 
order of development is in the use of 
the paired comparison item format, 
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or one of its derivatives, for interest 
and personality measurement. An 
initial normative scale may not be 
for an adequate ipsative 
scale in these areas, though we should 
not rest content to use paired com- 


essential 


parison scales in place of adequate 
normative scales in situations requir- 
ing the latter. 

3. With separate well in 
hand, it is appropriate to consider 
combinations for the prediction of as- 
sorted criteria. It also seems clear 
that this should be done both empiri- 
cally and statistically; i.e., the com- 
bination should be for a particular 
it should follow—not pre- 
‘measurement, and it should be 
computed by an appropriate formula. 
These criteria rule out use of Shel- 
don’s types. The last rules out clini- 
cal methods of combining test infor- 
mation. 

4. Consideration — of 


scales 


purpose, 
cede 


statistical 
methods of combining should not be 
limited to the multiple regression pro- 
cedure. The applicability of the dis- 
criminant function to problems asso- 
ciated in the past with types has al- 
ready been described. Other predic- 
tion problems may yield to pattern 
analysis techniques (8, 9). Note that 
these techniques are not typing pro- 
cedure as used in this discussion. Pat- 
tern analysis is here considered as an- 
other way of combining data to solve 
a particular problem. 


SUMMARY AND CONCLUSIONS 


Sheldon's physical and tempera- 
mental types, and their joint relation- 
ship, have been critically examined 
A number of limitations of his re- 
search are apparent. The physical 
types originated in the arm chair 
Measurement entered later as a means 
of differentiating objectively the sub- 
jectively determined types. It was 
also shown that the choice of types to 
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describe human physique and tem- 
perament automatically restricts the 
data in predictable ways. Of neces- 
sity, type interrelationships are nega- 
tive, and certain combinations of 
scores are prohibited by the nature 
of the concepts. As a result of the ex- 
pected mutual dependence of types, 
there is evidence in Sheldon's data 
for no more than two independent 
types, either of physique or tempera- 
ment. The research worker, if he uses 
these types, is therefore advised to 
discard one type of physique and its 
temperament counterpart in his in- 
Finally, the correla- 
tions relating physique to tempera- 
ment are invalidated by the fact that 
the same judge (Sheldon) was respon- 


vestigations. 


sible for both sets of ratings. 

With respect to type concepts gen- 
erally, it 
have 


was suggested that types 
traditionally defined as 
mutually exclusive ideals. Thus, two 
types 
high degree in one person. 


been 


can never be represented in 
Further- 
more, types have been defined by rel- 
ative measures so that no one is low in 
everything; i.e., a pigeonhole is pro- 
vided for everyone. This tends to 
give type concepts a spurious degree 
The the 
complex involved in the type is arbi- 


ol attractiveness size of 
trary, however, so that the number 
of types can vary from two up to the 
number of discriminable traits. Each 
such set of types has the same char- 
acteristics, but the average level of 
negative correlations decreases as the 
number of types increases. An in- 
creasing number of types also allows 
more degrees of freedom for various 
combinations of type scores, but cer- 
tain combinations will still be pro- 
hibited in even large numbers of types. 


When the number of types is equal 


to the number of traits, a type be- 
Traits, on 
normatively 


comes an ipsative scale 


the other hand, are 
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scaled. Normative scales, in contrast 
to ipsative scales, do not bias inter- 
correlations and allow all possible 
score combinations. Traits are rec- 
ommended for most predictive pur- 
poses, since most criteria are them- 
selves normative. Specific ipsative 
scales may be useful for certain prob- 
lems, though this question is largely 
unexplored. 

In place of a priori complex types, 
the use of the multiple discriminant 
function is recommended for prob- 
lems traditionally associated with 
typing. Discriminants may have 


properties similar to types, but al- 
ways differ in one important particu- 
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lar—they are computed to solve a 
particular problem. Still other prob- 
lems traditionally associated with 
type concepts may yield to pattern 
analysis techniques. 

For those investigators interested 
in problems of physique, it is recom- 
mended that they start with the trait 
approach and within reason exhaust 
its possibilities. One might wish sub- 
sequently to explore ipsative scoring 
of the separate traits. Finally, for 
specific prediction problems, various 
possible mathematical combinations 
of either normative or ipsative scales 
should be tried. 
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If an investigator should invent a 
new psychological test and then turn 
to any recent scholarly work for 
guidance on how to determine its re- 
liability (e.g., 6), he would confront 
such an array of different formula- 
tions that he would be unsure about 
how to proceed. After fifty years of 
psychological testing, the problem of 
discovering the degree to which an 
objective measure of behavior re- 
liably differentiates individuals is still 
confused. 

This confusion stems from a rigid 
adherence to unobjective and un- 
realistic postulates about the nature 
of measurement- assumptions origi- 
nally invented by Spearman and Wil- 
liam Brown a half century ago. We 
will review the unsuccessful efforts of 
psychologists over fifty years to free 
themselves from these restrictive 
orthodoxies. On the positive side, we 
will conceptualize and reformulate 
the problem in terms of the realities 
of objective measurement. Once an 
analyst has assessed the structure of 
a test, he can in most cases calculate 
the value of its reliability coefficient 
from the statistical constants of the 
test-samples (items) that compose it. 
The welter of different ‘‘methods”’ of 
calculating the reliability coefficient 
commonly employed are either dif- 
ferent computational forms that yield 
the same correct value, or they refer 
to various empirical designs devised 
to estimate this value. A parallel 
confusion exists over determining the 
communality and cluster domain 
validity, treated elsewhere by the 
writer (26). 


THe OsjyectivE OPERATIONS OF 
MEASURING INDIVIDUAL DIFFER- 
ENCES IN ANY BEHAVIOR 


The behavior analyst's first step in 
constructing a “‘test’’ is to conceptu- 
alize some property, X, of a group of 
individuals. If the individuals are 
men, X may be an ability like vo- 
cabulary knowledge, or some _ per- 
sonality characteristic like rigidity. 
If they are rats, the property X may 
be maze learning. If Drosophila, X 
may be the geotropic reaction. 

The second step is to define the 
property X in terms of objective spec- 
ifications that directly lead to the 
taking of test-sample observations, 
X11, X2,-:--:, X,, believed to elicit 
the defined behavior, X. These test- 
samples may be vocabulary items, 
ratings, entrances into blind alleys, 
vertical movements in a test tube. 

The third step is to compute for 
each individual its composite total 
score, X,, which is the sum of 


Xi, X>2, -— ee i.e., 
X= X,4+-X24 


The analyst know the 
within individual variance, or ‘‘error 
of measurement,” of the total score, 
X,. This individual variance would 
be the variability of the individual in 
many scores comparable to the ob- 
served composite, X, Any one of 
such comparable say X,’, 
would also be composed of the sum of 
n test-samples, thus, 


X= X'+Xe'4 +X,’, [2] 
in which the primed test-samples are 
conceptualized as being of the same 
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+X. (f] 


needs to 


scores, 
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kind as those in the observed com- 
posite, X,. The correlation between 
the observed X, and the comparable 
construct, X,’ is called the reliability 
coefficient, ru, of the observed com- 
posite. From the reliability coeff- 
cient the individual variance can, as 
we shall see later, be estimated. 
Current practice in computing rx is 
variable. that 
X,’ be an actual “comparable form" 
to X,. Others prefer the “‘split-half”’ 
method, with Spearman-Brown cor- 
rection for double length. Still others 
may compute r, from the variances 
of the observed test-samples (e.g., by 
the Kuder-Richardson formula). 
Some may even take ry to be the 
““test-retest”’ 


Some analysts insist 


correlation. 


PREVAILING ASSUMPTIONS ABOUT 
MEASURES OF INDIVIDUAL 
DIFFERENCES 


Virtually all writers assert that the 
use of these important formulations 
follow from certain assumptions 
about the n test-samples that make 
We 
shall examine these assumptions be- 
low, whence it will be obvious that 
rarely can it be 


up the observed composite, Xe 


shown objectively 
that test-samples do satisfy the re- 
quirements. This paper shows that 
such assumptions are not needed. In 
the last section of the paper we will 
see that despite the irrelevance and 
immateriality of these assumptions 
they have continued to govern most 
of the thinking about this problem 
since the inception many years ago of 
the Spearman-Yule theory of true 
and error factors on the one hand, 
and of the Brown-Kelley theory of 
statistically equivalent test-samples 
on the other. 

The Spearman- Yule theory of true 
and error We shall 
state this postulate here, later in the 
paper surveying its history. Spear- 
man presented it in principle in 


factors. merely 
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1910 (18), Yule participating to the 
extent of communicating a precise 
formulation of it to Spearman in 
1908 (27, Ch. 11, ref. 7). And 44 
years later Guilford accepts it as the 
“rationale’’ of psychological testing 
in his 1954 Psychometric Methods (6). 

In brief, the theory asserts that for 
scores on any two raw test-samples, 
X,; and X,, each deviation score, x; 
and x;, of an individual, is deter- 
mined by an “underlying’’ true fac- 
tor, x,, plus an “error” factor, e. The 
error factors of x; and x; are postu- 
lated as being uncorrelated with x,, 
and with each other. In short, 


4,= 2% tes) 


Xj=Xy bes 


[3] 


Vo, 20 = 0; 20 = V e,0; = 9. 


4 


Spearman thought of x, as “‘g,”’ a 
general factor running through all 
cognitive abilities, but in the eyes of 
modern factor analysts ‘‘g’’ is usually 
replaced by a composite of more than 
one common factor plus a factor 
specific to each test-sample. The 
true factor, x,, is released from its 
“underlying” status by some writers 
who conceive it to be the mean of 
many test-sample scores, but even in 
this conception the errors are postu- 
lated as being uncorrelated. The 
errors are furthermore postulated as 
operating equally in the test-samples, 
the net effect being that the test- 
samples reveal equal variances and 
equal inter-r's. All the standard 
formulas for computing the reliability 
of a test can be derived from these 
postulates (7, Ch. 2; 15). So derived, 
however, these formulas would be 
restricted in use only to those meas- 
ures for which it could be demon- 
strated that these postulates have 
substantive support—an unattrac- 
tive requirement for formulas so 
generally employed. 

The Brown-Kelley theory of statisti- 
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cally equivalent test-samples. The other 
conception, presented by William 
Brown (1, 2) also in 1910, later 
systematically formulated by Tru- 
man Kelley (13), is the basic doctrine 
generally accepted 40 years later by 
Gulliksen in his 1950 Theory of Men- 
tal Tests (7, Ch. 3 ff.). It postulates 
that all the m test-samples in X, must 
have equal standard deviations, and 
equal intercorrelations, i.e., 


o,=0;=0, a constant [5] 
’.,;=97, a constant. [6] 


In short, this conception ignores 
“underlying factors”’ but accepts the 
equivalence of test-samples. All the 
standard formulas for calculating the 
reliability and domain validity of a 
composite test can also be derived on 
these assumptions (7, Ch. 3; 13). 

Were we to restrict the use of these 
formulas only to tests whose test- 
samples met these conditions of 
strict equality of os and inter-rs the 
formulas would obviously not be ap- 
plicable to most of the commonest 
situations, such as those in which the 
test-samples are true-false items with 
differing proportions of 
sponses. 


true re- 


OBJECTIVE PRINCIPLES OF 
DOMAIN SAMPLING 


No restrictive postulates or as- 
sumptions about the observed test- 
samples are in fact required in the 
development and use of the standard 
formulations of reliable individual 
measurement. The standard formu- 
las follow directly from the opera- 


tions employed in objectively sam- 
pling behavior. Postulates of ‘‘under- 


lying factors’’ are superfluous, and 
test-samples may have different vari- 
ances and covariances. 

In computing the reliability coeffi- 
cient of the total score, X,, the 
analyst is seeking an answer to this 
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question: What is the value of the 
correlation, r.,, between the observed 
X, scores and a second set of compos- 
ite scores, X,’, earned on a “com- 
parable form" of the X, composite? 

Comparable form, X;', as a con- 
struct. We must define comparability 
of the X,’ composite in terms of the 
the observed X;, com- 
posite, as follows: A comparable X,' 
composite is one whose n test-samples 
vary on the average as much in os and 
inter-rs as do the n test-samples in the 
observed X, composite. Analysts may 
not anticipate actually setting up 


realities of 


such a comparable second composite 
in order to calculate ry. Indeed, it 
may not be feasible to do so. Further- 
more, with the exception of certain 
“stratified dis- 
cussed later, it is unnecessary to do 


composites’ to be 


so, for the second X,’ composite is a 


construct whose average statistical 
properties are by definition those of 
the observed X, composite at hand. 
This construct X,’ composite is in 
fact acriterion by which to determine 
the degree to which an actual second 
If such an 


actual second composite reveals the 


composite is comparable 


same average properties as the first, 
then it zs comparable to the first by 
definition; if its properties deviate 
from the 
parable. 


first, then it is mot com- 

To make the matter concrete, look 
at the data in Table 1. In the score 
matrix at the top left you see five ob- 
served test-sample scores of 10 actual 
individuals. Individual 1, for exam- 
ple, has the five X, scores, 6, 2, 1, 0, 0 
which add up to a composite X, score 
of 9. The problem is to calculate the 
reliability coefficient of the set of X, 
scores of the 10 individuals. To do so, 
we note the following average statisti- 
cal properties of these observed X, 
scores: 

1. They are the addition of n( =5) 
test-sample scores. 
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TABLE 1 


ILLUSTRATIVE ScoRE MATRIX, AND THE RELIABILITY COEFFICIENT, fe, 
CALCULATED BY Four ALTERNATIVE COMPUTING Forms 


Score matrix: N=10, n=5 


Test-sample, X; Individual Variance Form: 


= 
2 


X; X; 


(n?V..4+Me—n>_ M2) 


3 | 


2 
6 
12 
il 
3 


© 
| 
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~ 
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me 
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a 


4[25(4.768 + 600.25 —5(130.47) | 
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=1- 
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st 
SESE 


=1—.120 
= 880 


i 
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x 
a 


9 
10 


xX. | @ 


x 


— we rm oO 
“th 
oS 


a 
1S 


& 
_ 
~ 
NR 
_ 
mn 


47.68 V.,=47.68/10=4.768 


Variance Form wr 
| 
M | 6.9 6.30 4.30 3.80 3.20 24.50! 24.50 ” (1- £2%*) 


Pa 

| n-—l V; 

M? | 47.61 39.69 18.49 14.44 10.24) 600.25 | 130.47=)"M. 41.43 
a (:- 

V=0?| 4.29 9.81 12.21 7.96 7.16 140.05] 41 43=)°V; 4 140.05 


| | 


Part-Whole Form: 


=V, | V,=8.286 = 880 


SN.X,| 1804 1892 1427 1245 1035 | mers. [1 ge) 
n—1 


(So orin)* 


run S| «463 940.903.941.793 5 41.43 
zm ( rae) 


orn | .959 2.945 3.156 2.653 2.121 |S ore 
= 11.834 


Covariance (and its Approx.) Form Covariance Form: 
(Variance-covariance matrix*) : 
ae ‘ ees fu" » ¥ “ 
x, | 1.000) 313, .377, .270, .130) Vit(n— dy 
| “* | 4.29 | 2.03 | 2 1.58 72 | _5(4. 931) 
| X,| .313) 1.000) .771 923 756] ~ 8.286+4(4 931) 
a | 2.03 | 9.81 | 8.! 8.16 | 6.34 | = .880 
g48| $93| Covariance Approx. (Spearman-Brown) Form: 
8.36 | 5.54 ny 
"un 
848} 1.000, .707 Sette’ 
oY Pe 5(.5695) 


36 | 7.96 j 5.34 - ———__—____&= 
1+-4(. 5695) 


.593 707) 1.000 


‘4 | 5.34 | 7.16 
— 
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2. The mean variance, V;, of the 5 
constituent test-samples is V,=¢,? 
( =8.3, see 4th line from the bottom). 

3. The mean covariance, ¢;;, between 
the 5 test-samples is é;=o;74; 
(=4.9, see 3rd line from the bottom). 

To compute the reliability coeffi- 
cient, ra, we conceptualize a second 
composite, X,’, whose constituent 
test-sample scores of the 10 indi- 
viduals may take any values subject 
only to the following defined con- 
ditions: 

1. There be n(=5) of them. 

2. Their mean variance, Vy, equal 
that in the observed matrix, that is, 
Vi =Vi( =8.3). 

3. Their mean covariance, ¢y,, 
equal that in the observed score 
matrix, i.e., €yy = é4;(=4.9). 

4. The mean cross covariances be- 
tween the test-samples of X, and 
those of X,’ preserve certain relations 
to one another depending on the 
structure of X,, whether its test- 
samples are unstratified or stratified: 

Unstratified composites: If the ob- 
served test-samples are not ordered 
or grouped in any known way but are 
as if drawn at random from a large 
pool of test-samples, then by defini- 
tion the test-samples of the construct 
composite, X,’, must be similarly 
composed, hence the mean cross 
covariance, ¢;;, would equal the ob- 
served mean covariance, é;;. 

Stratified composites: If the ob- 
served test-samples are, however, 
ordered or grouped by known strata, 
then by definition so must be those of 
the comparable construct. We will 


_— 


® Ist entry is r;;; 2nd entry is oor = Ci; 
2nd entries: 
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consider this type of structure in a 
later section. 


UNSTRATIFIED COMPOSITES 
AND DOMAINS 

Under the definition of an unstrati- 
fied comparable construct, X;,’, we 
can compute exactly the reliability 
coefficient, ru, from the observed 
constants of X, and without further 
restrictive conditions. 

Let us now list, generally, the de- 
fined statistical properties of a com- 
parable construct composite, X;,’, in 
terms of the observed values of X;: 


n'=n [7] 
(Equality of number of test-samples), 
Vie=Vi (8] 
(Equality of mean variance), 
Esp = lj = Cig 9] 
(Equality of mean covariances). 


It follows from these definitions 
that the variance, Vy, of the second 
composite equals the observed V, 
since in the formula for the variance 
of a sum (5, p. 586) their parallel 
terms are equal by [7], [8], and [9], 
i.€., 

Vf =n'Vet+n'(n’—levy 


=nV ,+n(n—1)é;= Ve. 


[10] 


General Form for ru. The correla- 
tion, ru, is simply the Pearson r be- 
tween the sum, X,, defined by [1], 
and the sum, X;,’ by [2]. By the for- 
mula for the correlation between 
sums (5, p. 597) 


(1/N)D (x+2+ rae +n) (a1! +a! + whe + xn’) 


, 
oO; 


> diagonal =)°V, = 41.43; Vi=41.43/5=8. 286 
> remainder=2)> c,, = 98.62; é,=98.62/20—4.931 


= 140.05 


> all 


= V; 


ist entries: > 74;= 5.695; %=5.695/10— .5695 
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The consists of the 
cross covariances, n* in number, be- 
tween the test-samples of X,and X;,’, 
and hence reduces to n*éj; by [9]. 
The os in the denominator are equal 
from [10]. The general formula re- 
duces, then, to 


numerator 


n°; 


Tu= 11 
y, [11] 


(General Form of the reliability of an 
unstratified composite). 

Alternative computing formulas for 
r... To calculate this one value one 
may, however, use any one of four 
computing formulas which differ not 
in the answer they give but in certain 
constants of the score matrix which 
the analyst may prefer to use. These 
computing forms are: 

The Variance Form, variously called 
Alpha, or Ly, or for dichotomous 
variables the Kuder-Richardson (or 
K-R) formula 20. 

The Part-Whole Form, a special 
case of which is called “‘Gulliksen’s 
formula.” 

The Individual Variance Form, not 
reported elsewhere to the writer's 
knowledge. 

The Covariance Form, an approxi- 
mation to which is known as the 
Spearman-Brown formula. 

Confusion about these computing 
forms has been due to the fact that 
different writers have derived special 
cases or approximations of them. In 
their general forms they are identities, 
as shown in Table 1 where they all 
have the same value of r= .880. 
Further confusion has arisen because 
different writers have derived them 
on the basis of different assumptions 
or restrictions, thus leaving their 
readers in doubt about their applica- 
tion to real data. We shall see that 
no conditions other than the defini- 
tions given above in [7], [8], and [9] 
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are necessary. The four computing 
forms evaluate the genera! form, [11], 
by substituting in it terms easier to 
compute. Let us examine them in 
detail below. 

The Variance Form (Alpha, Ls, and 
the K-R special case). The simplest 
way to evaluate [11] is to solve for 
é,; in [10], and substitute its equiva- 
lent in [11], whence 


|] 
Tu 
VieiLn(n—1) 
n > V; 
(1 = ) [12] 
n —] V, 


(Variance Form: Reliability 
variances of test-samples and X,). 


from 


The computations are illustrated 
in Table 1 under ‘‘Variance Form” 
where the sequence of desk calculator 
operations for M, M*, and V both of 
the test-samples and of total X, 
scores is shown. Substituting the 
appropriate values in the Variance 
Form at the right gives r= .880. 

The Variance Form is called Alpha 
by Cronbach (4), L3 by Guttman (8), 
and the Kuder-Richardson Formula 
20 (16) for the special case of dichot- 
omous items. 

The Part-Whole Form (‘‘Gulliksen's 
formula’’). For purposes of item anal- 
ysis, one may be interested in the 
relation between each test-sample 
and the total composite score. He 
would calculate the correlation, rj, 
between each test-sample, X;, and 
the total score, X,; These correla- 
tions may then be used directly to 
compute the reliability, thus: 


n > V. 
Corer Lars seed [13] 


(Part-Whole Form: Reliability from 
r's between test-samples and X,). 
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Table 1 under ‘‘Part-Whole Form”’ 
gives the requisite desk calculator 
operations. Note that the value of rz 
is exactly .880. The several values of 
o7 are of interest since they reveal 
the relative contribution of the dif- 
ferent test-samples to the reliability 
of the composite. 

Gulliksen derived the Part-Whole 
Form under restrictive assumptions 
for the special case of dichotomous 
items (7, p. 378), but you can see that 
the form is general because the Part- 
Whole Form merely substitutes for 
V,in Variance Form [12] the equiva- 
lent expression, ( )_o,,,)?. For it can 
be shown that for any test-sample, 
X;, its part-whole correlation with 
X, is 
Vit Z O10 Fis 


‘e= (17) 


7,01 
TiO Fit “Vit ¥ oo s7;; (1#)). 


If we sum the m such terms for all 
X js, the total is V; by [10], as follows: 


>> TiO Vit _ > Vy + = pas 0 40 i755 ‘a a,” 
2. OF = Cy. 
(Dd) ou)? =Vi. 


The Part-Whole Form is useful as 
a check. In order to compute the 
sundry ry values, one must perforce 
calculate both V;and V;. Since these 
values are, however, the only ones 
needed in the simpler Variance Form 
{12} the extra labor of working out 
the rj correlations is unnecessary. 

The Individual Variance Form. lf 
the analyst wishes to study each indi- 
vidual’s variance, V,,, of its n test- 
sample scores, he can calculate re, 
from these individual variances. The 
formula is 
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(Individual Variance Form: Reliabil- 
ity from individual variances of test- 
sample scores). 


At the right in Table 1 you see 
under “Individual Variance Form” 
that the value of r,, from [14] is also 
exactly .880. The derivation of [14] 
from the Variance Form is a little 
cumbersome, and has been placed in 
the appendix. We shall see later that 
the numerator term in [14] is also the 
value of V,,, which is the individual 
variance of the total X, composite 
score (see 24a). For another ap- 
proach to this problem, see Horst 
(10). 

The Covariance Form (and the 
Spearman-Brown approximation). 
For a cluster or factor analysis of the 
test-samples the analyst may com- 
pute all the intercorrelations between 
the test-samples. If so, he can calcu- 
late r., from these rs. He would write 
these rs as the first entries in the 
variance-covartance matrix (or pooling 
as illustrated under ‘‘Co- 
variance (and its Approximation) 
Form” at the bottom of Table 1. If 
he then multiplies each r by its a; 
for the cell, he enters the covariance, 
¢,;, as the second entry. The mean of 
these covariances, ¢,;,, may now be 
used to calculate ry by the Covari- 
ance Form, which is 


square), 


nes; 


Vit (n=), 


Foq™= 


[15] 


(Covariance Form: Reliability from 
covariances between test-samples). 


Notice that the Covariance Form is 
simply the General Form [11] in 
which we have substituted the equiv- 
alent of V;, from [10] and cancelled n. 


—(n°V..4+Me—n)>, M2) 


n—1 


Tu=1l- 


(14! 


: 7 





236 


In Table 1, the essential summa- 
tions for V, and for ¢,; are given be- 
low the variance-covariance matrix. 
The Covariance Form at the right 
gives r,,= .880, as before. 

To the writer's knowledge the Co- 
variance Form has not been pub- 
lished as a computing form of ry. But 
it is one. Probably it has been 
ignored because it leads directly to 
the approximation familiarly known 
as the Spearman-Brown formula. 

Covariance Approximation Form, 
or the Spearman-Brown Formula. We 
can get rid of the covariances in [15] by 
taking the product of V, and #,; as an 
approximation to the mean covari- 
ance, ¢4;, 1.e., 


[16] 


C= 00 fis = V Fi. 


Substituting [16] in the Covariance 
Form [15] gives the Spearman- 
Brown (S-B) formula, 


nh sj 


~ 1+(n—Dhiy 


Tu 


(Covariance Approximation Form: the 
Spearman-Brown formula). 


In Table 1 the evaluation of the S-B 
formula does not give the exact value 
of .880; because of the approximation 
in [16], it takes the value, .869. 

As usually written in texts, the 
S-B formula is shown with r not as a 
mean but as a single value on the 
equivalence assumption [6]. The 
above development shows that the 
mean # between the test-samples is 
called for, and that no assumption of 
equivalent inter-rs is necessary. 

The special fame of the S-B for- 
mula stems from the belief that it 
saves the analyst work in computing 
r.. Before assessing this claim, let us 
recall that the Variance Form [12] 
requires only the computing of vari- 
ances. 
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The orthodox use of the S-B for- 
mula is with the split-half method. 
Here, the test-samples are divided 
into two halves, Xx; and Xx. The 
total score is thus expressed: 


X1=Xa,t+Xn,' 


The Covariance Form [15] and the 
S-B Form [17] become in this case, 
respectively, 


2Chihs 


ee 


Vater, 


[18a] 


Toe™= 


3 QF ath 


= [18b] 
1+Triry 


This split-half procedure requires 
extra work of the analyst who must 
compute a second score matrix con- 
sisting of the two split-half scores of 
the individuals, and then he must 
compute the correlation between 
these scores. As will be shown later, 
when the split is odd vs. even test- 
samples, the use of [18a] or [18b] is 
desirable when the order of the test- 
samples affects their statistical con- 
stants. 

At best, the answer by the S-B 
formula is an approximation which 
should be avoided by the use of the 
Covariance Form [18a]. For this 
case of unstratified composites there 
is the further trouble that the coeffi- 
cient from [18a] or [18b] is based on 
only one splitting, and there are of 
course many arbitrary ways to split- 
half the composite—random halves, 
item-parallel halves, odd-even halves 

-all yielding different values of r., by 
[18a] or [18b]. However, the mean 
of all these possible split-half rs turns 
out to be exactly the value given by 
our Variance Form [12], according to 
Cronbach (4). 

Behavior domain validity. <A sta- 
tistic that is more meaningful than 
the reliability coefficient is the corre- 
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lation of X, with a score on a domain 
of composites comparable to X,. A 
domain score of an individual would 
be the best criterion of his status in 
the property, X, as operationally 
defined. The domain score, usually 
called a ‘‘true"’ score, is defined as the 
sum (or average) of scores on a large 
number of composites, X,’, X,’’, etc., 
each of which has the same average 
statistical properties defined in [7], 
[8], and [9], that is, 


X tg = Xe Xl AX y+ fhe +Xi,, [19] 
where n,, is such a large number that 


1/n,, approaches 0; n/n, and 


(n—1)/n,, approach 1. {20} 


Such a domain of composite scores is 
of course a theoretical construct, but 
an important one, for if the analyst 
discovers that the correlation, ru,, of 
the observed composite X, with such 
a hypothetical criterion is very high 
he knows that the individuals are 
actually ranked in observed scores 
close to their ranking in a perfectly 
reliable measure of the property X, as 
operationally defined. If ru, is low, 
he knows he must improve his ob- 
served sampling of X. 

From the correlation of sums, and 
recalling the defined statistical prop- 
erties of the comparable composites, 
X.', Xi/", + » » asgiven in [7], [8], and 
[9], then 
(1/N) Do t+ e+" + + + + tag) 


T ttn ™ 
TO tye 


o2+(n,—l)oP hu 


OV Neo? + Nel Ne —1 oF 


Cancelling ¢,?, dividing numerator 

and denominator by n,, and noting 

the limits of [20], we get 
Vite = Vu 


(Behavior domain validity of X,). 


(21) 


237 


This correlation ryu,, has been 
labelled in textbooks as the ‘index of 
reliability,’’ a meaningless term. It is 
obviously the behavior domain validity 
of X,, because it is the correlation 
between a sample and its perfect 
criterion measure of the property X 
as operationally defined. For the 
data of Table 1, we find that the 
domain validity of the X, scores is 
/.880, or the high value of .94. 

Individual variance or ‘‘error vari- 
ance."’ A very practical experimental 
matter is the discovery of the degree 
to which an individual's rank in the 
group of observed X, scores would 
probably deviate from his ranking in 
the true X,,, scores. Thus in Table 1 
we would ask: with what confidence 
can we believe that individual No. 1, 
the low scorer with an X,=9, would 
still be lowest in domain score? 

We do not possess the domain 
score, of course, but we can never- 
theless estimate a probable range of 
an individual's observed score around 
it, at a chosen level of confidence. 
To do so, let us express the domain 
score on the scale of the observed 
scores by writing it as the average of 
the individual's composite scores in 


[19], i.e., 
Xie =(1/mn)(Xi4 Xi + °° 


+X). [22] 

The deviation of an individual's 
observed X, score from his domain 
score, X,,, is commonly called his 
“error of measurement” in X,. This 
is a bad term because the experi- 
menter usually has no objective 
grounds for establishing that the 
fluctuations of an individual's ob- 
served performances are “errors'’— 
in fact, they usually are genuine 
variations that simply deviate from 
an average ‘‘true’’ parameter value, 
¥,,, of the individual. 
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The meaning and importance of 
assessing this deviation, desirable in 
all psychological testing, is particu- 
larly evident in the case of behavior 
genetic experiments. How different 
must the scores of two individuals be 
in order to be sure they are reliably 
different? In selective breeding for a 
pure line one would breed together 
only those extreme individuals whose 
scores are not reliably different. The 
experimenter would also cease selec- 
tive breeding in a ‘‘pure’’ line whose 
variance in observed scores equalled 
the individual or “‘error’’ variance. 

This individual variance, called V,,, 
is quite simply assessed. It is the 
variance in observed X, scores of a 
subgroup of individuals that are 
identical in their domain X,, scores. 
Since we know the correlation be- 
tween X, and X,, from [21], we com- 
pute V,, from the orthodox formula 
for the standard error of estimate of 
X, scores from | a scores, i.€., 


Ve = V1 = Tite’) -= V1 _ Tee), [23a] 


whence, o,, the individual standard 


deviation, is 


(23b] 


One does not assume that all indi- 
viduals have the same value of the 
individual variance. The value by 
[23a] is the average of the N indi- 
vidual variances, being the mean 
squared deviation around the theo- 
retical regression line of X,; on X4,. 

To illustrate from Table 1, let us 
understand the X, scores there to be 
the sum of maze errors of rats in five 
trials (as indeed these scores actually 
are). The variance between indi- 
viduals in these scores is V,=140. 
Were we to selectively breed together 
over many generations rats with the 
lowest scores, like individual No. 1, 
we would cease selection when the 


Fo,= owl =_ Ter. 
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line had a variance of V,,=140 
(1—.88)=17 errors. Further selec- 
tion would be fruitless since all ob- 
served variation between rats in the 
line would be variance within indi- 
vidual rats and not between them. 

A complementary ‘‘basic definition’ 
of the reliability. We can express 
rather meaningfully the reliability 
coefficient, ry, as a function of the 
within-individual variance, V,,. From 
[23a] we can write ry, as 


r 


V, 
(Individual Variance Ratio Form: 
Reliability from V., of total score, X¢) 


fu=1l-— 


[24a] 


To illustrate from the data of Table 1, 
the reliability becomes ry=1 
—17/140=1—.12=.88, as in the 
other computing forms. 

The individual variance ratio, 
V.,/ Vein [24a], is thus the proportion 
of the total variance, V;, among all 
individuals that is determined by the 
individual variance, V,,. Here the 
ratio is 12%. The complement of this 
proportion is the value of the reliabil- 
ity coefficient itself which may be 
written 


[24b] 


Thus the reliability coefficient is that 
remaining proportion of the total 
variance which would be the variance 
of the domain scores of the indi- 
viduals. 

Expressions [24a] and [24b] have 
been sometimes referred to as the 
“basic definition” of the reliability 
coefficient because of the meaning- 
fulness of the proportional terms. 
You will note that the Individual 
Variance Ratio Form in [24a] is a 
parallel identity with the Individual 
Variance Form [14]. Thus the value 
of the individual variance, V,,, of the 
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total X, score {see 23a] can be com- 
puted directly from the mean indi- 
vidual variances, V., across the test- 
samples, via the numerator term 


of [14]. 


STRATIFIED COMPOSITES AND 
DoMAINS 
Item-Parallel, Split-Half, Test-Retest 
and Battery Reliability 


In the foregoing treatment we have 
been examining the unstratified case 
in which the mn test-samples, X,, 
Xo,:--+X,, are drawn with equal 
likelihood, or without bias, from a 
potential pool of such test-samples. 
Situations often arise, however, in 
which the analyst conceptualizes the 
property X he seeks to measure as 
definitely made up of substrata of 
properties. For example, where X is 
vocabulary knowledge, he may con- 
ceptualize it as made up of strata of 
different levels of difficulty, or of dif- 
ferent kinds of content. If X is a 
personality attribute, the sample 
judges may come from strata per- 
taining to degree of acquaintance 
with the subject. Or the test-samples 
may be associated with different 
stages of learning or of test-adapta- 
tion of the subjects. 

Composites of n strata. ‘‘ Item-paral- 
lel tests." A special case of a com- 
posite drawn from a stratified domain 
is the one in which each of the n items 
is expressly drawn from one of n de- 
fined strata. To understand this case, 
return to the definitions of the X, 
composite in expression [1] and of its 
comparable construct, X,’, in equa- 
tion [2]. In this case, X, is parallel to 
X,’, i.e., they are drawn from a de- 
fined stratum, like two true-false 
items of the same kind of content. 
Similarly, X_ is parallel to X,’, and 
X; is parallel to X,’, and so on. 

In our earlier definition of the com- 
parable construct, X,’, the fourth 
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condition referred to the cross co- 
variances between the test-samples of 
X, and its X,’ construct. When, as 
in this case, the nm test-samples fall 
into n strata, then the mean cross- 
covariance term of r by [11] is the 
weighted mean of the following two 
types of mean cross covariances: 


é4v, the mean cross covariance be- 
tween the n pairs of parallel test- 
samples—an unknown value. 


é,;;, the mean cross covariance be- 
tween the nonparallel items, n*—n in 
number, where 1#j’, and by defini- 
tion equal to é,;. 


We therefore rewrite the General 
Form [11] of r for unstratified com- 
posites in the following form for 
stratified composites: 


ney + (mn? — nei; 


V, 


I= (1#)) [25] 


(Reliability of a composite with n 
strata of test-samples). 


To see this form more simply let us 
take the approximation of [16], sub- 
stitute the equivalent of V, from {10}, 


cancel nm and V,, whence 


Fie t(n—1)%,; 
n= (1#]) 


| 26] 
1+(n—1)f%;; 


(Approximation to 25). 


The reliability coefficient of a strati- 
fied composite by [26] is the ana- 
logue of that of an unstratified com- 
posite by the familiar S-B formula. 
The mean correlation, #;,, between 
parallel test-samples will usually be 
higher than that between nonparallel, 
i.€., 


Fig DF ij [27] 
whence the reliability of composites 
of n strata will usually be higher than 
that for unstratified composites. The 
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lowest value that #,, could reasonably 
take in any composite would be 
¥ =F. Substitution of this lowest 
limit in [25] and [26} reduces them, 
respectively, to the General Form 
{11] and the Spearman-Brown for- 
mula [17] for unstratified samples. 
Therefore, when one is dealing with a 
stratified composite the lower limit 
of its reliability is that found by any 
of the earlier computing forms for an 
unstratified composite. 

The troublesome feature of an ob- 
served composite made up of nm strata 
is that its reliability by [25] or [26] is 
indeterminate without a second set of 
n parallel test-samples from which 
rv can be calculated, Without such a 
comparable set of test-samples one 
must usually be content with finding 
the lower limit of ra from [11]. 

Odd-even split-halves. When the n 
strata of X, represent test-samples 
that are a serial order of responses of 
the subjects, usually the case with 


psychological tests and learning tests, 
a solution of ry is available by the 


odd-even split-half method. By this 
procedure one scores each subject on 
two subcomposites as follows: 


Xno=XitXat+Xe 
and so on for all odd test-samples, 
Nne=Xat+-Xet+Xe 


and so on for all even test-samples. 

Here one has two comparable com- 
posites which would satisfy the four 
defined conditions of comparability 
including the approximate matching 
of test-samples for serial order. The 
correlation, ra,,, is thus by definition 
the reliability of a composite with 
one-half the number of test-samples 
but with the same serial stratification 
of the total composite, X,. Substi- 
tuting rra, into [18a] or [18b] gives 
the desired value of r¢&. 

This common method of determin- 
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ing the reliability of a composite is 
the correct procedure to follow in all 
those testing situations where the 
test-samples do have a serial order. 
It should give values for reliabilities 
either equal to but usually higher 
than that computed by [11] for un- 
stratified composites—a fact com- 
monly observed (e.g., 12, Ch. 4). 

Replicated composites (‘‘Test-retest 
reliability’). The analyst who wishes 
to observe the constancy of indi- 
vidual differences over time may ad- 
minister the same composite, X;, on 
several occasions. In this special 
case, the actual replications, X;,’, 
X,’, are duplicate parallel compos- 
ites, test-sample for test-sample, 
rather than matched test-samples, as 
just above. 

Before actually performing these 
replications, the analyst can estimate 
the lower and upper limits of the cor- 
relations between the observed X, 
composite and the proposed replica- 
tions of it. These limits can be esti- 
mated from the constants of one ad- 
ministration of X, alone. At the 
lower limit the covariance, é;, be- 
tween duplicate test-samples on rep- 
lication would probably not be lower 
than the covariance, é;, between 
nonparallel test-samples on the first 
occasion (if the proposed time inter- 
val is reasonably short), hence for 
€x = Cx; in [25], we find the probable 
lower limit ci the test-retest correla- 
tion to be that between unstratified 
composites by [11]. There is, how- 
ever, no prior knowledge of the effect 
of elapsed time on the property, X. 
It could easily be that between the 
first administration of X, and its 
second, X,’, the cross covariances, 
éi and é,, may be much less than 
é,;, and in the limit it is possible that 
the test-retest coefficient could go to 
zero, or negative. 

At the upper limit, the correlation 
between replicated test-samples could 





RELIABILITY AND BEHAVIOR DOMAIN VALIDITY 


be 1.00, and their variances could 
remain constant over time. In this 
case, you will note that in [25], and 
more obviously in [26], the numera- 
tors would equal the denominators, 
whence the upper limit of the test- 
retest coefficient would be 1.00. 
Guttman (8) has studied these limits 
in some detail, basing them, how- 
ever, on certain restrictive assump- 
tions not required in the above 
analysis. 

Composite of test-samples with vart- 
able N (‘Speed tests’). <A ‘speed 
test’’ with a restricted time limit may 
, veal an increasing proportion of the 
N subjects falling into the class, ‘‘No 
response,’ on the m successive test- 
samples, X,, X2, +--+ X,; similarly, 
with a “power test’’ having test- 
samples of increasing difficulty. The 
reliability, ru, by [26] becomes inde- 
terminate in this case because it is not 
possible to compute either #,, or #;;, 
even if one had at hand an actual 
comparable composite with parallel 
test-samples X,’, X2’, ---,X,’. The 
reason is that “No response’ does 
not fall in the same continuum with 
the quantitative scores of those who 
do respond. Our formulations only 
apply to that fraction of all N sub- 
jects who respond to all n test-sam- 
ples—usually a subgroup of restricted 
range. 

A special experimental design is re- 
quired of a test if it is to satisfy the 
definition of a speed test: all subjects 
should respond to each of the nm test- 
samples, the measure taken on them 
being elapsed time. Under such a de- 
sign, reliability or its limits becomes 
determinable by the formulations 
presented here. With a graded power 
test our formulations also apply pro- 
vided the class ‘‘No response”’ can be 
legitimately assigned the lowest score 
possible on the test-sample, usually 
zero. 

Subdomains 


(Battery reliability). 
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The more general case of a stratified 
domain, and one which does provide 
an exact solution of the reliability co- 
efficient from the constants of the 
observed composite, is the one in 
which the strata represent defined 
subdomains of test-samples. Com- 
mon examples would be a vocabulary 
test with one stratum being a block 
of items at one difficulty level, a sec- 
ond stratum of items at another 
level, and so on. Test “batteries,” 
such as the Wechsler-Bellevue, are of 
this form. The reliability coefficient 
of the grand total score, X,, is calcu- 
lated either from odd-even  split- 
halves or, as shown below, from con- 
stants of the blocks of observed test- 
samples. 

In this general case, the composite 
scores of X, and of another compara- 
ble construct composite, X;,’, as de- 
fined by [1] and [2] may be written: 


X,=)>X, 
+ >oX+ 
X/=)>)) X,’ 
+>) Xu + >) Xu’. 


where the parallelism is of the sub- 
domains, g, h, ---, &, with parallel 
blocks of test-samples }°X, and 
sy a > Xs and X,’, and so on, 
there being & such parallel blocks of 
test-samples. By our definition of 
comparability, there are n, test-sam- 
ples in Xo, Mm in pe A and so on. 
For parallel blocks, equal mean vari- 
ances and covariances of conditions 
of [8] and [9] hold by definition. 

From the correlation of sums, the 
reliability of X,, which is the correla- 
tion between X, and X,’ of [28] be- 
comes, after a little algebra, 


sp ) » TigteV tg + 2> Ctaty 
V, 


(Covariance Form of the reliability of a 
stratified composite), 


set DO Xe | 


[28] 


Tu [29] 
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where reliability of the 
total score on each of the k sets of X, 
block of test-samples, simply com- 
puted by the Variance Form [12], 
V,, is the computed variance of the 
total scores on the X, block; Do Ctets is 
the sum of covariances between total 
scores on different X, and X, (g#h) 
blocks: V, is the variance of the 
grand total scores on the stratified 
composite, X,, calculated from the 


v7... 1s the 
yg 


variance of a sum, 1.e., 


Vy > Vig t 2) Camm 


A simpler computing form results 
from solving for dD Ctets in [30], and 
substituting its equivalent in [29], 


{30} 


whence we get the identity 


+» Va— 3 V iF tote 


31 
: 31] 


A he 


(Variance Form of the reliability of a 
stratified composite) 


kor the relatively common situa- 
tion in whi h the total scores on the 
different strata are in different units, 
would convert them to 


with Vi, =1, 


the analyst 
sigma whence 


[31] reduces to 


a > V tote 


re 
t 


scores, 


{32] 


Ti* 1 


Variance Form of 31 for equally 


weighted strata), 


where n, is the number of subdomains 

or strata, V,’ is the variance of the 

equally weighted total score. 
There unknowns in 


situation, and hence an exact value of 


are no this 
r,, for such a stratified composite can 
be caleulated. In Formula 29 or 31 
we have a completely general formula 
for a determinable reliability co- 
efficient, ra. For the special case of a 
stratified composite made up of mn 
parallel test-samples, treated in the 
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preceding two sections, Formula 29 
reduces to Formula 25. For the case 
of unstratified domains, the numera- 
tor of [29] reduces to n*é;; which leads 
to our General Form [11], whose ap- 
proximation is the orthodox Spear- 
man-Brown formula [17]. 


SIMPLIFIED FORMULATION FOR Com- 
POSITES OF DicHoToMouUS TEST- 
SAMPLES (ITEMS) 


Part of the confusion about the dif- 
ferent forms of computing the reli- 
ability of a composite is due to the 
fact that, since psychologists have 
focused on mental tests or question- 
naires composed of dichotomous 
items, like Yes-No questions, some of 
the computing forms for reliability 
have been publicized by their special 
cases for this situation, like the K-R 
formula, while others are known by 
their general case for continuous test- 
samples, like the Variance Form 
(Alpha) or the S-B formula. 

When the test-samples are dichot- 
omous, the score matrix has 1's and 
0's in it instead of continuous values 
as in Table 1. 
Vi=paqi, where p, is the proportion 
of Yes’s or 1’s in test-sample X,, and 
qi=1—p;. For this special case we 
therefore insert > pag for > V. in 
Variance Form [12]. This computa- 
tional variant of the Variance Form 
has been labelled the K-R Formula 20 
after its authors (16). This appella- 
tion is undesirable not only because 
the K-R is simply a 
situational the Variance 
Form but primarily because its de- 


In such a special case, 


form minor 


Variant ol 


rivers assumed that the test-samples 
are determined by one general factor 
and should have equal variances and 
inter-rs. The direct derivation of [12] 
from [11] demonstrates that 


‘ 


these 
‘assumptions’ are, in fact, unneces- 
sary restrictions. 

Similarly, the 


Part-Whole 


Form 
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{13} was derived by Gulliksen for the 
purely Yes-No test-samples. Here, 
one needs the Pearson r of a dichoto- 
mous item with total score, a con- 
tinuous variable. The computational 
form of r simplifies in this case to the 
point-biserial r, but the Part-Whole 
Form by [13] is in no way restricted 
to point-biserials, as our derivation 
shows. 

In like fashion, an analyst using 
the Covariance Form [15] or its S-B 
approximation [16] which require the 
inter-rs between test-samples, would 
compute these rs from the phi coef- 
ficient, the product-moment r_be- 
tween dichotomous test-samples. 

For the Individual Variance Form 
[14], one would need to calculate each 
individual's variance, Vo. = Pode, 
where p, is simply >> (Yes's)/m, and 
qo=l1 =D. 

The Total Score Form. The speedi- 
est means of determining the reliabil- 
ity of a dichotomous-item test is by 
use of the Total Score Form which 
requires only the mean and variance 
of the total X; scores. This approxi- 
mation is in fact the only means of 
determining the reliability in experi- 
mental situations where the analyst 
may not be able or may not wish to 
record the performance of subjects on 
the n test samples but records only 
their final total X, scores. An exam- 
ple is Drosophila experiments where 
the flies physically traverse the n 
test-sample situations but ultimately 
wind up in test tubes corresponding 
to the different classes of the X, total 
score scale (9). 

The Total Score Form for com- 
puting r, derives from the Variance 
Form [12] in which the essential 
terms are PN V°,= > Pdis and V,. 
The latter term, Vz, is, of course, 
computed directly from the distribu- 
tion of total X, scores. We can ap- 
proximate the >> p,q, term (dropping 
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the « subscript) by the development 
below: 


> P9 


DL (e-p)=Le-DL P? 
=n( p)—n(p*). [33] 


From the formula for the mean of a 
sum, we note that 


n( p)= M, [34] 


Since the p values of the different 
recorded we 
the 


cannot 
the 


items not 


exactly 


are 


compute mean of 


squared ps, i.e., p?. As an approxima- 
tion we can take the mean of their 
squares to be the square of their 


mean, 1.e€., 


p?=p, 


which will be close if the p values of 
the items do not vary too greatly. 
We can write the approximation as 
follows 


[35] 


n( p?) =n(p)?*=M?/n. |.36)} 


Substituting [34] and [36] in [33], and 
writing the Variance Form [12] in 
full, 


n M,.—M,2/n 
oq = (1- ) [37] 
n—1 V, 


(Total Score Form: Approx. reliability 
from the total X, scores only). 


The Total Score Form is known as 
the Kuder-Richardson Formula 21 
(16). The reliability determined by 
the use of it the 
correct value which would be found 
by the Variance Form [12] or odd- 
even split-half. There are no 
in the derivation of the 
involves only 


is a lower limit of 


‘é 
AS- 


sumptions” 
Total Score 
the approximation of [36] and it ap- 


Form; it 


proat hes the correct value according 
as the item ps approach equality 
Experience seems to show that a rela- 
tively wide range of p values can be 
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tolerated without the value from [37] 
deviating greatly from the correct 
value. 


HIstoRY OF ORTHODOXY IN 
MENTAL TEST THEORY 


In the preceding section we have 
seen that for unstratified composites 
all the commonly used formulas for 
determining the reliability coefficient 

the Variance Form [12], the Part- 
Whole Form [13], the Individual 
Variance Form [14] and the Co- 
variance Form |15}—are identities, 
being merely computing forms of the 
General Form [11]. We have also 
seen that the derivation of them in- 
volves no assumptions of ‘“‘underlying 
factors,”’ or of statistically equivalent 
test-samples. They are quite simply 
derived on the objective principles of 
domain sampling. The test-samples 
that make up the composite X, are 
taken as they come, with unequal 
variances and covariances. 


Surprising it is, therefore, that 


virtually all writers on mental test 
theory over the last half century have 
clung either to the orthodox theory of 
true and error factors, or the theory 
of equivalent test-samples, the first a 
set of unverified postulates, the sec- 


ond obviously unrealistic. An at- 
tempt to explain why psychologists 
have been unable to free themselves 
from these two rigid mental sets 
over about 50 years will not be made 
here. This section endeavors merely 
to document this orthodoxy. 

We find an early, clear formulation 
of the truth-error factor theory by 
Spearman in 1910 (18). He expresses 
test-samples as “x;, X2, °° + =x+d), 
x+ds, where x is the underlying 
regular measurement, while the ds are 
the superimposed accidental com- 
ponents” (p. 289). From these 
postulates he then develops his 
famous formula (our Formula 17), 
later known as the Spearman-Brown 
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formula. We note here also the ex- 
pression ‘“‘underlying’’ which has 
come down through time as a key 
conception of factor analysts. 

In the same edition of the British 
Journal of Psychology, William 
Brown takes a more objective view of 
test-samples and derives the same 
S-B formula on the theory of equiva- 
lent test-samples (1, footnote p. 299; 
see also 2). We thus have in these 
writings the beginnings of the two 
opposing orthodoxies. 

Spearman's truth-error doctrine 
dominated thinking for a consider- 
able time thereafter, even of the op- 
ponents of Spearman's general factor 
theory. In his general statistics text 
Yule (27) adopts the truth-error 
formulation of Spearman but with 
the caution “if the further assump- 
tion is legitimate that the errors in 
d, and d, are uncorrelated with each 
other” (p. 212). 

In their Essentials of Mental Meas- 
urement in 1922, Brown and Thomson 
(3) have a real go at the Achilles heel 
of the truth-error postulate, namely, 
the belief that the alleged errors of 
measurement, the ds, are ‘“‘acciden- 
tal,”’ “random,” and ‘‘uncorrelated.” 
As they pungently put it, “When an 
individual [is measured by two test- 
samples] there is no error of observa- 
tion involved. [The scores] are the 
actual true measures of ability on the 
two occasions. The average or mean 
ability [x,] - - - is doubtless different 
from either, but that does not make 
the other two measures erroneous. 
[The deviations of these scores from 
the mean ability] represent individual 
variability, and to assume them un- 
correlated with one another or with 
the mean values is to indulge in 
somewhat a priori reasoning” (p. 
158). However, Brown and Thomson 
do not themselves otherwise formu- 
late the problem. 

It is Kelley in his 1924 Statistical 
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Methods (13) who appears first to or- 
ganize clearly and elegantly the 
derivation of reliability on the as- 
sumption of equivalent test-samples. 
He begins his development by reduc- 
ing the test-samples to z scores thus 
making their variances equal; then 
further assuming their inter-rs to be 
equal, he develops all the basic 
formulas. Just as the Spearman-Yule 
formulation of the truth-error doc- 
trine sets the orthodoxy on the fac- 
torial side, so has Kelley's formula- 
tion been the bible for proponents of 
the equivalent test-sample doctrine. 

Soon after, in Holzinger's 1928 
Statistical Methods for Students of 
Education (11) we find the paradox of 
a writer's accepting both orthodoxies, 
but in different parts of his text. In 
developing the S-B formulation of the 
reliability coefficient, Holzinger fol- 
lows Kelley’s development (p. 168) 
but in deriving the standard error of 
measurement (p. 250 ff.) he strictly 
follows the Spearman-Yule truth- 
error postulates. 

In 1930, fresh from graduate 
school, the present writer wrote an 
article on the reliability coefficient 
(21) following in part the pattern of 
the truth-error conception. By 1935 
he knew better, and in a paper re- 
jecting the whole concept of factors 
as being psychologically and biologi- 
cally unrealistic dismissed the notion 
that individual variability could be 
thought of as “error” (22). He has 
not budged from this position since, 
and in place of the current vogue for 
factor analysis he has developed the 
methods of cluster analysis (23, 24, 
25) based on the same general prin- 
ciples of domain sampling as ex- 
pressed in this paper. 

The factorial conception of a test- 
sample score had its modern dress-up 
by Thurstone. In 1931 he started 
conventionally with a short brochure 


on The Reliability and Validity of 
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Tests (19) where he formulates the 
problem in orthodox Spearman-Yule 
fashion. But by 1935 in The Vectors 
of Mind (20) he had fully developed 
the now familiar postulate of break- 
ing up X, into several additive 
“underlying” multiple factors plus an 
uncorrelated specific. This concep- 
tion is, however, a rigid adherence to 
the truth-error doctrine, though it 
contributes a new interpretation of 
“truth.” 

A different approach to the method 
of computing reliability is presented 
in the 1937 publication of the Kuder- 
Richardson Formula 20 (16), which 
determines the reliability coefficient 
of a test from the variances of its 
dichotomous items. Confusion 
been considerable over the ‘‘assump- 
tions’ of this formula because the 
authors derive it via the truth-error 
doctrine, alleging that the test-sam- 
ples must measure only one common 
factor and also be statistically equiv- 
alent. We have seen, however, that 
the K-R formula is but a special case 
of the Variance Form [12], and hence 
involves no assumptions whatsoever 
about the factorial composition or 
equivalence of the test-samples. 

The next landmark is the excellent 
mathematically based 1940 statistics 
text by Peters and Van Voorlhis (17). 
They take no original approach to 
our problem, however, following the 
formulations of Kelley and thus be- 
coming faithful followers of the 
equivalent test-sample camp. 

The same year Jackson and Fergu- 
son published their provocative mon- 
ograph on the reliability of tests (12). 
On the grounds that “implicit in the 
reliability concept is the idea of re- 
peated measurement” (p. 77), they 
seem, however, to accept the empiri- 
cal correlation between comparable 
forms, split halves, test-retest as 
better measures of reliability than the 
correlation between a and its 


has 


test 
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comparable construct. They turn to 
variance analysis to separate out 
“errors of measurement," true indi- 
vidual differences, and practice ef- 
fects. In their examination of the 
K-R formula they trenchantly note 
that, as we have emphasized here, 
Kuder and Richardson “‘fell into the 
common error of specifying condi- 
tions that are sufficient but unneces- 
sary’ (p. 72), and that the less re- 
strictive condition of equal covari- 
ances (our expression 9) is required 
(p. 75). In their treatment of the re- 
liability of a battery (what we have 
called a “stratified composite’’), they 
do, however, act ept as its reliability 
its correlation with a comparable con- 
struct and derive from it the basic 
formulas for this situation. Had 
these writers adopted this same con- 
ception of reliability in their treat- 
ment of the reliability of an unstrati- 
fied composite, their formulation of 
this general probiem would not have 
deviated substantially from that of 
the present author. 

A complete break with orthodoxy 
is Guttman’s 1945 treacment (8) of 
test-retest reliability of a composite. 
His objective is to establish lower 
limits of this coefficient from the 
known constants of the test-samples 
of X,. His signal finding is that our 
Variance Form [12], his Ls, is such a 
lower limit. However, to achieve 
such a limit he introduces a different 
set of assumptions from the usual. 
He says, “‘We use essentially only one 
basic assumption; that the errors of 
observation are independent between 
items [our test-samples] and between 
persons over the universe of trials [our 
X,, Xe, Xe", «+ +). = In the conven- 
tional approach, independence is 
taken over persons rather than trials” 
(p. 257). Now, “independent errors 
of observation" clearly implies a fac- 
torial construction of test-sample 


ROBERT C. TRYON 


performance, and hence is a postulate 
requiring substantiation. His basic 
assumption, though interestingly 
new, is quite unnecessary, for we 
have seen above in our treatment of 
the test-retest situation that, without 
any assumptions, the Variance Form, 
or Gutman’s Ls, is a lower limit of the 
test-retest coefficient. 

An interesting conversion was that 
of Kelley in his rather monumental 
1947 Fundamentals of Statistics (15). 
Here he deserts the objective ap- 
proach which in large part he initi- 
ated in 1924, and joins the truth-error 
votaries. Signs of this conversion are 
nevertheless implicit even in his 
earlier 1924 book where he is pre- 
occupied with rules for the construc- 
tion of two comparable tests designed 
to cut down correlation 
“errors” in order to arrive at the 
“true reliability coefficient’ (p. 
201 ff.). After 1924 he became a fac- 
tor analyst of his own special sort 
(14), and thus by 1947 his concept of 
a test-sample is that it is either “tan 
expression of [a] common function- 
plus-chance, or of this common func- 
tion - plus-a - non-chance-unique-func- 
tion-plus-chance”’ (15, p. 401). Start- 
ing with this orthodox factorial con- 
struction, he develops all the essential 
basic formulas we have developed 
earlier in this paper. 

The most comprehensive attack on 
the problem is Gulliksen’s 1950 
Theory of Mental Tests (7). His 


between 


formulations are, however, largely 
restricted to the case of dichotomous 
test-samples, such as true-false test 


items. He first presents the truth- 
error doctrine in orthodox form, un- 
fortunately labeling it ‘“‘The basic 
assumption of test theory” (Ch. 2, 
p. 4), and develops all the basic 
formulas from it. However, by 
Chapter 3 he sheds this “basic as- 
sumption’ and for the rest of the 
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book develops the general and spe- 
cialized formula for reliability and 
validity on the postulates of equiva- 
lent test-samples, called ‘parallel 
tests’’ by him. These are defined ob- 
jectively “in terms of observable 
characteristics,” namely, equal 
means, equal sigmas, equal inter-rs 
(pp. 28-29). When he comes to the 
treatment of the Kuder-Richardson 
formula, he nearly breaks free, for he 
derives the K-R formula without the 
crippling factorial and other restric- 
tive assumptions of its original au- 
thors. However, this derivation is 
developed for two composite tests, 
X, and X,’ “that are parallel item for 
item’ (p. 221), which is the special 
case developed earlier in this paper 
under “‘Item-parallel tests’’ (see our 
Formula 26). His derivation of the 
Variance Form via item-parallel test- 
samples is most unfortunate, for you 
may recall that for stratified com- 
posites one needs é,, the covariance 


between parallel items, and é,;, that 
Since an ana- 


between nonparallel. 
lyst rarely has available the second 
test, X’, the covariance between 
parallel items is unknown, so in this 
dilemma, Gulliksen retreats to the 
orthodoxy of equivalent test-samples 
and states, unrealistically, ‘the sim- 
plest and most direct assumption is 
that . the covariance between par- 
allel items is equal to the . CO- 
variance between nonparallel items” 
(p. 223), an assumption our expres- 
sion [27] rejects. Had Gulliksen ap- 
proached the K-R formula, or the 
more general Variance Form, not for 
item-parallel tests but for unstrati- 
fied composites, he would have 
term and probably 


not 
run into the é, 
would have seen that for the Variance 
Form the test-sample 
doctrine is an unnecessary restriction 
of it or of any of the basic formula. 
He derives the Part-Whole 


equivalent 


also 
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Form, our [13], but to get it he em- 
ploys the K-R formula as a step, and 
thus bounds the Part-Whole Form 
by the same restrictions and the 
dichotomous item situation he sets 
for the K-R formula, all quite un- 
necessary as we have shown. 

In 1951 Cronbach's treatment (4) 
of the Variance Form for computing 
ru, Which he calls Alpha, compre- 
hensively explores the usefulness of 
this computing form. Hlowever, 
Cronbach derive the for- 
mula in a clear way that permits the 
reader to see what assumptions, if 
any, are taken to arrive at it. He 
shows how to interpret the Variance 
Form in terms of the truth-error 
factorial postulate (p. 312). He also 
states that with respect to the corre- 
test-samples, it is 


does not 


lation between 
“among items having equal variances 
(p. 323), the 


postulate 


and equal covariances” 
equivalent test-sample 
One gains the impression from Cron- 
bach that the Variance Form is a 
generalized formula, but have 
seen that the General Form [11] is 
more general and that [29] is even 
more general. The Variance Form 
shares equivalent status with the Co- 
variance, Part-Whole, and Individual 
Variance Forms in being another al- 
ternative computing form of [11]. 
We finally come to the 1954 re- 
vision of Guilford’s Psychometric 
Methods (6). This generally excellent 
reference work is marred for the pres- 
ent writer by the strict adherence of 
Guilford to the truth-error factor 
doctrine, which he labels “The Ra- 
tionale of Test Reliability’’ (p. 349). 
All the computing forms for reliabil- 
ity and domain validity of compos- 
ites are nicely organized by Guilford, 


we 


but his reader is not let in on the fact 
that they can be objectively derived 
even on the equivalent test-sample 


doctrine. The reader is led to believe 
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that the truth-error doctrine is the 
only mental test rationale. The 
reader of the foregoing pages will now 
see that the manifold formulas in test 
construction, so ably arrayed by 
Guilford, derive quite directly from 
the objective principles of measure- 
ment by domain sampling, and that 
the factor postulates of Guilford are 
just one type of orthodoxy, and quite 
unnecessary if the reader has no pre- 
dilection for “underlying” factc>s 


APPENDIX 


Derivation of the Individual Variance Form. 
The variance of any individual across the 
n test-samples is Vow where 


=> X2/n—(>. Xi/n)*. 
The mean_ individual 


variance becomes, 
noting that >> X,=X,, 
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VanSef oe Xf X27) 


Eero] EM). 


The first term of the member on the right 
can be found from constants of the n test- 
samples, i.e., 

First term =(1/n)( >. Vi+ © M2) =Vi4+(1 
/n) > Me. 

The second term on the right =(1/n")(V, 
+ MP). 
Substituting these two terms in [38], solving 
for V;, we get after a little algebra, 


Vi=V 41/0) V.4+(1/n) M2—n>, M2. [39] 


Now, to find ru, we substitute [39] for V, in 
the Variance Form in which V5 =nV;, mul- 
tiply numerator and denominator of the frac- 
tion by nm, then after a little manipulation 
we finally get the Individual Variance Form 
{14}. 
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PROPORTIONS BY SLIDE RULE 
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One of the statistical problems 
most frequently confronting the re- 
search worker in psychology is the 
comparison of proportions or percent- 
ages of various groups manifesting a 
particular attribute. For example, 
the proportion of Ss answering ‘‘yes”’ 
to a questionnaire item in one psy- 
chiatric category might be compared 
with the proportion in another psy- 
chiatric category ; or vocational groups 
might be so compared. 

The traditional procedure has been 
to use the formula for the standard 
error of the difference between two 
observed proportions, although many 
authors of statistical texts (e.g., 2, p. 
54) add that the sampling distribu- 
tion becomes highly skewed for ex- 
treme proportions, e.g., greater than 
.90 or less than .10. This makes use 
of tables of the normal distribution 
(or ¢ distribution) quite inaccurate in 
these situations. 

However there has been available 
for several years a transformation, 
known as the inverse sine function (1, 
p. 165), which tends to satisfy the 
condition of normality of experimental 
errors and thus makes the normal dis- 
tribution tables useful for comparing 
even extreme proportions. Yet psy- 
chologists have evidently made little 
use of this function. In one of its 
forms, Zubin's ¢ (3) (which is mot the 
same as the Student-Fisher ¢), the 
function is defined as 


(=2 sin™' v p, 


in which p is the observed proportion 


and ¢ is expressed in radians. The 
standard error of ¢ is approximately 
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1//N, and for comparison of two 
proportions the following ratio is ap- 


proximately normally distributed: 
hi-t 


a Y 

fia 

N, N2 
Since the sampling distribution of 
this function involves no parameters 
other than the number of observa- 
tions, confidence limits can be set at 
once for an entire list of test items 
without the bother of calculating the 
standard error for each item sepa- 
rately as in the traditional procedure. 
Table 1 enables the researcher to 
write down Zubin's ¢ directly from the 
two frequencies (e.g., “‘yes’”’ and 
“‘no’’) using only a slide rule. A cer- 
tain amount of accuracy is sacrificed, 
but this error is small relative to the 
sampling error unless the number of 
observations is quite large The 
table gives interval boundaries for ¢ 
intervals of .05 (which correspond to 
p intervals of .025 in the middle 
range) in terms of the corresponding 
ratio of the two frequencies. In the 
traditional procedure, a preliminary 
step of adding the two frequencies 
for use as the denominator of the 
proportion was necessary before divi- 

sion could be undertaken. 

Researchers who often deal with a 
large number of observations may 
wish to construct a similar table with 
finer intervals of ¢. Therefure the 
procedure by which Table 1 was con- 
structed for intervals of .05 will be 
described. In the first column of the 
work sheet were written the follow- 
ing values of ¢ in radians: 3.125, 
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TABLE 1 
ZuBIN's ¢t FOR RATIO INTERVALS 





Ratio 


293.1 


143 
84 
55 
39... 


29 


3.075, 3.025, 2.975, .. . , 0.125, 0.075, 
and 0.025. The next column con- 
tained the corresponding values of 
t/2 in degrees, obtained by multiply- 
ing each value in the first column by 
28.648. Then the sine of each of the 
values in the second column was re- 
corded in the third column (4). The 
fourth and fifth columns contained 
sin? and 1—sin’*, respectively. Then 
the ratios for Table 1 were obtained 
by division. For the left column of 
Table 1 each sin* was divided by the 
corresponding 1—sin*; for the right 
column, 1—sin® by sin’. 

Table 1 has been constructed so 
that the larger frequency must al- 
ways be used as the numerator of the 





Ratio Ratio 


4.672 §.223 
4.110 003 


3.632 949 


.223 116 
868 582 
44 


8&6 


ratio. The purpose here is to enable 
greater use of the left side of the slide 
rule, which is easier to read. When 
the positively scored category (e.g., 
‘“‘ves’') has the larger frequency, the 
left side of Table 1 should be used; 
when the negatively scored category 
(e.g., ‘‘no’’) has the larger frequency, 
the right side should be used. A ¢ 
value of .00 is equivalent to a p value 
of .00, and a ¢ of 3.1416 (pi) is equiv- 
alent toa p of 1.00; however for ease 
of calculation it is more convenient 
to use 3.15 for the latter. 

As an example consider the “yes” 
and “no” frequencies given in Table 
2 for five questionnaire items. For 
Item 1 the ratio 1.53 falls between 
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TABLE 2 
Fictitious EXAMPLE OF Five 
QUESTIONNAIRE ITEMS 


Item “Ta60C—téC<“C Ratio t 


34 1.53 

58 2.07 
7 11.3 2.55 
43 1.00 1.55 
74 6.17 75 


1.80 
1.20 


1.672 and 1.509 on the left side of 
Table 1. This interval is represented 
by a ¢t of 1.80. Similarly the ratio 
2.07 falls between 2.026 and 2.255 on 
the right side, giving a ¢ of 1.20. Al- 
though a separate column of ratios 
has been included in Table 2 for ex- 
pository purposes, in practice the 


researcher can write down the ¢ value 
for each item directly by inspection of 
the slide rule and Table 1. The t 
values to be compared can be sub- 
tracted rather easily ‘in one’s head” 
if they are efficiently arranged on the 
data sheet. Then the absolute dif- 
ferences which exceed a certain value, 
determined by use of the standard 
error formulas given above, can be 
noted. 

After using this function for a 
while, the researcher will find that he 
begins to think directly in terms of 
ts; e.g., at of 2.50 represents a very 
high proportion of positive replies, 
and a ¢ of 1.00 represents a moder- 
ately low proportion of positive re- 
plies. 
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DISCRIMINABILITY SCALES FROM RATINGS! 
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A variety of closely similar meth- 
ods (1, 2, 3, 5) has been employed to 
scale ratings according to the prin- 
ciple of Thurstone’s Law of Com- 
parative Judgment (4), i.e., with 
variability of judgment as the unit of 
measurement. These methods suffer 
from at least one common defect: the 
unit contains a spurious or irrelevant 
component arising from differences in 
the constant rating tendencies of 
individuals. In a typical situation 
some raters will use all the scale 
categories whereas others will concen- 
trate their judgments in two or three; 
some of the latter raters will rate at 
the low end of the scale, some at the 
high end, and some in the middle. As 
a result, the dispersion of ratings for 
any given stimulus is greater than it 
would be if all the raters used the 
scale in the same way. 

It would be clearly desirable to 
equate all the raters with respect to 
such “constant error’’ tendencies, 
and use only the remaining ‘‘variable 
error’ component as a yardstick of 
discriminability. This might be ac- 
complished by transforming the rat- 
ings of each individual observer into 
z-scores, and then considering the 
distribution of such z-scores for each 


' This report is based on work done under 
ARDC Project No. 7706, Task No. 27001, in 
support of the research and development pro- 
gram of the Air Force Personnel and Training 
Research Center, Lackland Air Force Base, 
Texas. Permission is granted for reproduction, 
translation, publication, use, and disposal in 
whole and in part by or for the United States 
Government. 


stimulus. To convert the ratings of 
each individual into percentiles or 
simply into ranks is computationally 
easier, however, and equally effective. 
(The fact that many tied ranks will 
necessarily occur in a given observer's 
ratings does not detract from the 
procedure, but in fact simplifies it.) 
Some distribution of ranks will then 
be associated with each stimulus, and 
an equal-discriminability scale may 
be constructed by applying to the 
rank scale a transformation which 
renders all such distributions as 
nearly as possible equal in dispersion. 

The complete scaling procedure 
consists of the following operations: 

1. Ratings are converted into 
ranks for each individual observer, 
considered independently of all the 
others. This conversion amounts to 
assigning a rank to each of the in- 
dividual’s rating-scale categories: e.g., 
if the rating scale has only seven 
categories, then seven particular 
ranks will describe all the individual's 
ratings, though what these ranks are 
will depend upon the way in which 
his ratings are distributed. The for- 
mula for the rank of a particular cate- 
gory k, for a particular observer, is as 
follows: 

jmk—1 


> nj t+.Sm+-.5, 


j=l 


%,= 


in which m, is the number of ratings 
placed in category k, and the >on; 
term is the total number of ratings 
placed in lower categories, by this one 
observer. 
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It may be noted, in passing, that 
the present step would be eliminated 
if the observers actually placed the 
stimuli in a complete rank order in- 
stead of using a rating scale, but that 
subsequent steps would be entirely 
applicable to data so obtained. 

2. For each stimulus, a distribu- 
tion of the ranks assigned to that 
stimulus by the various observers is 
plotted. Measures of central tend- 
ency and dispersion are found for 
each such distribution. Median and 
interquartile range are suggested as 
preferable to mean and standard de- 
viation not only because they are 
easier to compute, but also because 
the rank distributions of high and 
low stimuli typically display a marked 
skewness toward the middle of the 
Moreover, the distributions 
tend to be somewhat leptokurtic even 
when they are not skew. (Perhaps 
this is an overformal way of saying 
that there is usually some “‘lunatic 
fringe’’ of observers whose ratings 
tall “outside the distribution.) In 
view of this leptokurtic tendency, 
whatever we have felt 
safer in using the interquartile range 
than a wider one, though in the case 
of a normal distribution the interval 
between the 7th and 93rd percentiles 
is the most stable. 

3. A graph is prepared on which 
the reciprocal of the interquartile 
range of rank distribution is 
plotted against the median of the dis- 
tribution, as in Fig. 1. The reason for 
using the reciprocal is that we wish to 
deal with number of interquartile 
range units per rank, rather than 
with number of ranks per interquartile 
range unit. The points may be ex- 
pected to fall into a U-shaped func- 
tion which indicates that extreme 
stimuli are judged more precisely 
than those of intermediate value. A 
smooth freehand curve is dfawn to fit 
these points; in the empirical cases 


scale. 


its source, 


each 
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with which we have dealt, the data 
have determined such a curve with 
relatively littlke ambiguity (see Fig. 
1). The fitting process may be facili- 
tated by the use of either a “rolling”’ 
average or of averages over certain 
fixed intervals. 


mf 
4 4 4 4 4 4 4 4 — 
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Fic. 1, RECIPROCAL OF INTERQUARTILE RANGE 
OF STIMULUS RANK DISTRIBUTION AS A 
FUNCTION OF THE MEDIAN OF 
THE DISTRIBUTION 


4. Next the integral or cumulative 
area of this curve above some arbi- 
trary origin is obtained. A table giv- 
ing the cumulative area of the curve 
for each rank may be compiled with 
no great effort by adding successive 
ordinates of the curve, cumulatively, 
on a desk calculator. The values so 
obtained will be approximate, but 
highly accurate. 

Suppose, for example, that the low- 
est stimulus has a median rank of 8. 
We may start integrating at this 
point, and assign rank 8a cumulative 
area value of zero. The value of the 
integral for rank 9 will be equal to the 
ordinate (read from the graph) at 8.5. 
The value for rank 10 will be equal to 
the ordinate at 8.5 plus the ordinate 
at 9.5, and so on. 

5. This table enables us finally to 
transform the median ranks of the 
stimuli into values on an equal-dis- 
criminability scale. Each of the slices 


added in the integration process rep- 


resents the (fractional) number of 





EQUAL-DISCRIMINABILITY SCALES 


interquartile range units between one 
rank and the next. Therefore the sum 
of these slices, up to a given rank, 
represents the number of interquartile 
range units between a stimulus with 
that median rank and the origin 
(which may, as suggested above, be 
placed at the lowest stimulus). 

Interquartile range values may be 
transformed into equivalent standard 
deviation values simply by multiply- 
ing through by the constant 1.349, 
which is the ratio of these two meas- 
ures for the normal distribution. For 
most purposes the values might as 
well be left in terms of interquartile 
range. In Fig. 2 the cumulative area 
under the function in Fig. 1 is 
graphed, with scale values expressed 
in both units on the ordinate. 
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Fic. 2. ScaALeE VALUES CORRESPONDING 


MEDIAN RANKS OF STIMULI 


In an experiment which will be re- 
ported elsewhere, 114 pairs of poly- 
gons were rated on a similarity-dif- 
ference scale by 140 observers. Each 
pair was treated as a “stimulus,” for 


scaling purposes. The ratings were 
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scaled by both the graded-dichot- 
omies method (1) and the present 
ranking method. (Figs. 1 and 2 are 
based on data from this experiment.) 
The whole rating and scaling proce- 
dure was then replicated with new 
observers. In the original experiment, 
scale values obtained for the stimuli 
by the graded-dichotomies method 
covered a range of 5.5 sigma units, 
whereas the ranking method yielded 
a range of 6.7 sigma units. In the 
replication, the graded-dichotomies 
method gave a 4.9 sigma spread, and 
the ranking method a 5.7 sigma’ 
spread. 

The correlation between original 
and replication was .988 for the 
graded-dichotomies values, and .992 
for values obtained by the ranking 
method. These figures imply that the 
amount of experimental-and-scaling 
error in the graded-dichotomies val- 
ues was about 50°%, greater than that 
in the ranking method values (pre- 
sumably most of the error was associ- 
ated with random but genuine differ- 
ences of judgment between the two 
samples of observers, rather than 
with scaling technique). It was fur- 
ther found that scale values obtained 
by the present method yielded higher 
correlations with certain physical 
measures the stimulus material 
than did graded-dichotomies values. 
In short, the new method has been 
supported by all the empirical tests 
which have thus far been applied to 
it. Its primary justification should 
nevertheless remain the rational ar- 
gument presented earlier, 


on 
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LEARNING CURVES—FACTS OR ARTIFACTS?! 
HARRY P. BAHRICK 
Ohio Wesleyan University 
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The Ohio State University 


From an operational viewpoint, 
theory and method are intimately re- 
lated. A theory has little value if no 
method is available for testing its 
predictions, and a method has no 
validity if it is not representative of 
the operations specified by some 
theory. Yet there are many instances 
in learning research in which behavior 
measures are chosen as a matter of 
convenience, rather than on theoreti- 
cal grounds. This is done despite the 
fact that most theories of learning 
emphasize the distinction between 
the basic process of learning and indi- 
cants of this process (12, 16, 17) and 
despite the fact that the importance 
of this distinction is demonstrated in 
studies reporting low correlations 
among various indicants of presum- 
ably the same learning process (5; 10, 
p. 138). 

One of the most common instances 
of the arbitrary choice of response 
measures in learning studies is the use 
of a dichotomous score as an indicant 
of an underlying process which is 
known or assumed to be continuously 
distributed. We have reference to 


1 This research was conducted in the Labo- 
ratory of Aviation Psychology of The Ohio 
State University and was supported in part 
by the United States Air Force under Con- 
tract No. AF 41(657)-70 with the OSU Re- 
search Foundation, monitored by Dr. Robert 
L. French, who served as Scientific Officer for 
the AF Personnel and Training Research 
Center. Permission is granted for reproduc- 
tion, publication, use, and disposal in whole 
or in part by or for the United States Govern- 
ment. 

* The authors are indebted to Dr. David 
Bakan for many constructive criticisms. 
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the use of arbitrary criteria of success 
and failure, or arbitrary criteria of 
the occurrence of a response, such as 
the extent of entry into a cul-de-sac 
necessary for recording an error in 
maze learning, the magnitude and la- 
tency of a reaction necessary for re- 
cording the occurrence of a condi- 
tioned response, or the size of the 
target used in determining the num- 
ber of hits in a skilled task. A con- 
tinuity viewpoint holds that the proc- 
esses underlying these phenomena 


will produce a continuous and often 
normal distribution of response meas- 
ures. 

It is our purpose in this paper to 
show that the arbitrary choice of a 


cutoff point in the dichotomizing of 
continuous response distributions can 
impose significant constraints upon 
the shape of resulting learning curves, 
and that this can form the basis of 
misleading theoreticalinterpretations. 
We have chosen for illustration of 
this point the use of time-on-target 
scores as indicants of the level of skill 
attained in tracking tasks. However, 
we believe that the principles de- 
veloped are quite general and apply 
to many learning situations. 
Time-on-target scores reflect the 
amount of time during a trial that S 
is able to remain within an arbitrarily 
specified region around a target. A 
great many reports have been pub- 
lished during the last few years in 
which E has made use of such scores. 
In a few studies the effects of target 
size upon transfer of training have 
been examined (3, 8). Two studies 
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(7, 14) have dealt with the validity of 
time-on-target scores, and one of 
these (7) concluded that their useful- 
ness is limited because the correla- 
tions between such scores and aver- 
age error scores varies as a function 
of target size and problem difficulty. 
However, emphasis has not been 
placed on the importance of the con- 
straints which the choice of a scoring 
zone exercises upon the shape of 
learning curves plotted from the re- 
corded data and upon the conclusions 
which can be drawn from these func- 
tions. 

It is our purpose here to show that 
the same tracking behavior, when 
scored with different target-tolerance 
standards, will result in learning 
curves which differ greatly in shape, 
and that the differing shape of learn- 
ing curves obtained with various- 
sized scoring zones can be predicted 
theoretically from assumptions re- 
garding the error-amplitude distribu- 
tions. In further support of this view 
we present empirical data which indi- 
cate, in the case of tracking behavior, 
that the underlying error distribution 
to which all conventional scores can 
be related is continuously and nor- 
mally distributed. Finally, we point 
out that a lack of understanding of 
these differential characteristics of 
response measures can easily lead to 
incorrect conclusions regarding the 
effects of other variables. 


EMPIRICAL DATA 


The data reported here are taken 
from two studies (1, 5) in which Ss 


practiced one-dimensional tracking 
tasks on an electronic tracking ap- 
paratus described elsewhere (18). 
The tracking problems in the two 
studies varied in difficulty. We shall 
present first the learning curves ob- 
tained for the more difficult problem 
(5). In this study the target motion 
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consisted of a 10 cpm sinusoidal mo- 
tion of a line on a cathode-ray-tube 
(CRT) display. A filter with a time 
constant of .4 sec. introduced an ex- 
ponential lag between the output of 
S's arm control and its effect on 
tracking error. A compensatory dis- 
play was used which provided a tar- 
get line that remained stationary in 
the center of the display, and a cursor 
that moved to the right or left de- 
pending on the direction of the error 
from moment to moment. Two types 
of performance measures were taken 
on even-numbered 90-sec. trials: RMS 
error scores and time-on-target scores. 

An electronic circuit provided a 
means of continuously obtaining the 
magnitude of the error (in the form 
of an electrical voltage), squaring this 
voltage, and integrating it over the 
period of a trial. The output of this 
circuit appeared on a voltmeter and 
the square root of this meter reading 
provided an index of the root mean 
squared error (RMS). The error volt- 
age is computed with respect to an 
absolute reference of zero volts, 
whereas S's error amplitude distribu- 
tion may show some constant bias 
toward plus or minus voltages (i.e., 
error to the right or left of the target). 
As a result, the error RMS reflects 
both the variability of S's distribu- 
tion of amplitudes and any small con- 
stant error in average cursor position. 

Time-on-target measures give the 
total time that the absolute magni- 
tude of the error voltage was smaller 
than a given magnitude. Three such 
scores were taken for target zones of 

%, 15%, and 30% of +5 v., which 
was the maximum problem voltage. 
These zones correspond to errors of 
1, .3, and .6 in. of displacement of 
the cursor to either side of the target 
line, respectively. The three zones 
(from smallest to largest) will be re- 
ferred to hereafter as zones A, B, and 





258 


C in order to avoid confusion when it 
is desired later to refer to the percent- 
age of time S was “on target."’ Simi- 
larly, the time-on-target scores will 
be referred to as scores A, B, and C. 

Fifty male and 50 female Ss were 
used, and since the male Ss were su- 
perior trackers on the average, sep- 
arate learning curves are presented 
for the two groups. These curves are 
the empirical curves shown in Fig. 1 
and Fig. 2 for the RMS scores and 
the three time-on-target scores. 

The various learning curves sug- 
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gest different accounts of the relative 
and absolute improvement during 
practice. The curves for time-on-tar- 
get scores all suggest that absolute as 
well as relative improvements during 
tracking were greater for the male 
than for the female Ss. This effect is 
particularly apparent for the smallest 
target zone (zone A) and becomes 
progressively less pronounced for the 
larger target zones. Between trials 
2 and 14 the males improved by 
33.2%, 31.9%, and 18.7% for scores 
in zones A, B, and C, respectively, 
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TRACKING PROBLEM 
while the corresponding improve- 
ments for the females are only 2.5%, 
17.6%, and 11.8%. The RMS 
curves, however, indicate a greater 
improvement for the females, with a 
22.3% reduction of error as 
trasted with a 20.4% error reduction 
for the males. And all of these scores, 
it should be remembered, are derived 
from a single error voltage! 

The widely divergent picture of the 
amount of improvement resulting 
from practice, given by the four 
scores described above, can be ac- 
counted for on theoretical grounds, 
which will be developed in the next 
section. Briefly, it will be shown that 
time-on-target scores are nonlinear, 


con- 
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being relatively insensitive to changes 
in level of performance both above 
and below a critical region. Most of 
the female Ss who served in the study 
referred to above, for example, were 
relatively poor trackers at the begin- 
ning of practice, and the zone A time- 
on-target score was almost completely 
insensitive to any improvement at 
this level of skill. Improvements at 
this level, however, were reflected in 
fairly large reductions in the error 
RMS score. 

We shall now present data from the 
second study (1) which illustrate the 
same kind of scoring artifact with a 
less difficult tracking task. As in the 
first study, the problem was that of 
compensating for a 10 epm target 
oscillation. However, no lag was in- 
troduced between the control output 
and the cursor movement. Several 
control loading conditions were used 
in this study as independent varia- 
bles. We have selected four learning 
curves from the condition in which 
both a spring and a.mass were used 
to load the control, since this condi- 
tion of the study generally resulted 
in the best performance. Twenty-five 
males served as Ss. The mean learn- 
ing curves obtained for the RMS 
score and for three time-on-target 
scores employing the same relative 
target zones as were used in the previ- 
ous study are shown as the empirical 
curves in Fig. 3. 

It can be seen that the empirical 
curves in Fig. 3 again give different 
accounts of the improvement in per- 
formance at different stages of prac- 
tice. The zone C curve is negatively 
accelerated and shows most of the im- 
provement during the early trials, 
with smaller improvement during the 
last few trials. The curve for zone A, 
on the other hand, shows the largest 
gain during the last two trials, and 
relatively less gain during the early 
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trials. In comparing the empirical 
curves of Fig. 1 and Fig. 3 the most 
striking contrast is furnished by the 
slope of the zone A curves. In Fig. 3 
we can note a great deal of improve- 
ment of zone A scores, and these 
scores seem to provide a very sensi- 
tive index of learning. In Fig. 1 much 
less gain is registered for the same 
zone A scores. Particularly for female 
Ss in Fig. 1, this score reveals scarcely 
any improvement, despite the fact 
that the reduction of the RMS error 
is as large as for the data shown in 
Fig. 2. 

It can be shown that the differ- 
ential sensitivity of the scores in 
these two studies is determined by 
the variation in task difficulty. The 
change of sensitivity of individual 
scoring zones as a function of task dif- 
ficulty has been mentioned earlier, 
but will now be dealt with more sys- 
tematically. 


PREDICTION OF LEARNING CURVES 
FOR VARIOUS TIME-ON-TARGET 
ZONES 

If we assume that the amplitudes of 
tracking errors form a normal dis- 
tribution during a trial, it is apparent 
that the percentage of this distribu- 
tion which would fall within a given 
target zone can be determined, pro- 
vided the standard deviation of the 
distribution of tracking errors is 
known. To illustrate the differential 
sensitivity of various scoring zones, 
we show in Fig. 4 predicted time-on- 
target scores for five target zones of 
differing size as a function of the 
magnitude of the RMS value of the 
error distribution. 

Successive values for these curves 
are found by determining the ratio 
of the scoring zone, in volts, to the 
RMS values of the error distribution, 
also in volts. The ratios are z scores 
and the percentage of a normal dis- 
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tribution between zero and each z 
score is found from a table of the nor- 
mal curve. These values are multi- 
plied by two to include errors on both 
sides of the target, and are plotted on 
the ordinate opposite the assumed 
RMS value. 

It can be seen that each of the 
curves in Fig. 4 shows a maximal 
slope at a different range of variation 
of the RMS value, and becomes in- 
sensitive to variations outside that 
range. The ranges of maximal sensi- 
tivity shift toward smaller RMS val- 
ues as we move from larger to smaller 
target zones. The sensitivity of a 
time-on-target score is maximal when 
the zone is of a size that includes +1 
SD of the error distribution, so that S 
is on target about 68% of the time. 
For smaller or larger target zones the 
score becomes progressively less sen- 
sitive to changes in the RMS value of 
the error distribution. 

Functions similar to those shown 
in Fig. 4 can be plotted for target 
zones of any desired size, and it is ap- 
parent that curves for very small tar- 
get zones would show their maximal 
sensitivity in an RMS range in which 
the curves of larger target zones have 
already approached an asymptote. 

It is obvious that a score cannot re- 
veal improvements once performance 
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is approaching an asymptote of 100% 
time on target. However, the relative 
lack of sensitivity of each score at low 
performance levels is not generally 
recognized. 

Empirical learning curves can be 
expected to depart from the curves 
shown in Fig. 4 for at least two rea- 
sons: (a) the theoretical curves in 
Fig. 4 are plotted for linear decreases 
in error RMS, while the observed de- 
creases of error RMS during practice 
are usually a negatively accelerating 
function; and (6) the theoretical 
curves are also based upon the as- 
sumption that the amplitude distri- 
bution of error is normally distrib- 
uted on all trials. In order to assess 
the significance of departures from 
normality in the data reported in Fig. 
1 and Fig. 3, we have plotted curves 
for all three target zones using the 
observed RMS values for each trial 
and the corresponding ordinate val- 
ues frorn the theoretical curves in Fig. 
4. The divergence of the predicted 
curves from the corresponding em- 
pirical ones can be attributed in large 
measure to departures of the error 
distributions from normality. Unre- 
liability of the electronic scoring 
equipment would, of course, also con- 
tribute to such divergence, but is be- 
lieved to be quite small in the present 
case. 

It can be seen that the empirical 
curves in Fig. 3 correspond moder- 
ately well to those predicted from the 
assumption of a normal distribution 
of error amplitudes. In Fig. 1 the 
correspondence of empirical and pre- 
dicted curves is close in the case of 
the zone C curves. For the zones A 
and B curves, male Ss performed bet- 
ter than would be predicted on the as- 
sumption of normality, and for the 
zone A curves this divergence is quite 
pronounced, particularly during the 
last few trials of practice. On the 





262 


basis of our analysis we would expect 
these relations only if the amplitude 
distribution of tracking error were 
more peaked than a normal distribu- 
tion, i.e., if the area near the center of 
the distribution were greater than 
predicted from the z scores. In order 
to check this prediction we shall need 
to examine in detail the empirical 
error-amplitude distributions of Ss 
during tracking. This we proceed to 
do in the next section. 
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EMPIRICAL ERROR-AMPLITUDE 
DISTRIBUTIONS 

In Fig. 5 we present empirical dis- 
tributions of the error amplitude on 
trials 2, 6, and 14 for the data re- 
ported in Fig. 1. These distributions 
were obtained by means of an error- 
amplitude analyzer, an apparatus 
that has been described in more de- 
tail elsewhere (5). Also shown in this 
figure are normal curves with the 
same mean and SD as those of the 
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empirical distributions. Inspection 
of Fig. 5 confirms our theoretical pre- 
diction from the data of Fig. 4. It can 
be seen that the error distributions of 
female Ss do not depart greatly from 
normality, and neither do the pre- 
dicted values of their learning curves 
(Fig. 1) vary greatly from the ob- 
served ones. For male Ss, however, 
the obtained distributions are more 
peaked than the corresponding nor- 
mal ones, particularly on later trials. 
This confirms our interpretation of 
the departure of the empirical learn- 
ing curve from the predicted curve for 
zone A scores (in Fig. 1). 

Because the data of Fig. 5 are 
pooled for 50 Ss, the peaking of the 
combined error-amplitude distribu- 
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tion may be the result of at least two 
different conditions. It is possible 
that this type of departure from nor- 
mality characterizes the individual 
error distributions of the majority of 
Ss, or it may be that all or most indi- 
viduals show normal error distribu- 
tions, but that we have a nonnormal 
distribution of individual differences 
among our 50 male Ss. In other 
words, the combining of 50 normal 
distributions with different SDs can 
yield a curve such as we obtained, 
provided sizable proportions of these 
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curves (i.e., individuals) represent 
unusually good and unusually poor 
performance. 

To determine which of these two 
situations prevailed we analyzed in 
more detail the data for the 50 males 
on trial 14, since this distribution 
(Fig. 5) shows the most pronounced 
departure from normality. The error 
amplitude distributions of each of the 
individual Ss on this trial were con- 
verted into z scores after the SD of 
each S's own amplitude distribution 
was determined. The ordinates for 
successive z-score values of .1 were 
averaged for the 50 Ss and the result- 
ing distribution is shown in Fig. 6, 
together with a normal distribution. 
It can be seen that the peaking effect 
is not due to departures from nor- 
mality in the error-amplitude distri- 
butions of individual Ss, but rather 
that it is due to the combining of nor- 
mal distributions which among them- 
selves are not normally distributed. 
We are apparently dealing with a 
situation in which individual differ- 
ences are normally distributed early, 
but not later in learning. 

The problems involved in inter- 
preting learning curves based on 
group data have been the concern of 
several recent papers (2, 4, 15). In 
the present instance, however, we are 
chiefly interested in accounting for 
the departures of the obtained curves 
from the predicted curves in Fig. 1. 


The progressive peakings of the group 
amplitude distributions for male Ss 
appear to be a sufficient explanation 


for this phenomenon. During the 
later stages of practice the change in 
the shape of the pooled amplitude 
distribution made the zone A 
curve more sensitive and the zone C 
curve less sensitive than would be the 
case if the group error-amplitude dis- 
tribution had remained normal. 


has 
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CONFOUNDING OF EFFECTS PRo- 
DUCED BY THE MANIPULATION 
OF INDEPENDENT VARIABLES 
AND ARTIFACTS PRODUCED 
BY SCORING 

We have shown in the preceding 
section that learning curves based 
upon a particular-sized time-on-tar- 
get zone will be maximally sensitive 
over a relatively narrow range of 
RMS values, and be relatively in- 
sensitive to variations of RMS out- 
side that range. This means that 
large differences in the RMS value of 
the error-amplitude distribution may 
exist and may result in small or large 
differences of performance on a time- 
on-target score, depending upon the 
sensitivity of the score over the RMS 
range in question. If we now use a 
time-on-target score as a means of de- 
termining the functional relation be- 
tween two variables, functions at var- 
ious stages of learning, or functions 
for several versions of a task which 


differ in difficulty, the change in 
sensitivity of our measure must be 


taken into account. Particularly 
must we guard against scoring arti- 
facts when we look for interaction ef- 
fects. Otherwise we may conclude 
that the independent variable pro- 
duces important effects at one stage 
of training and not at other stages, or 
important effects on a simple-task 
version and not on a difficult-task 
version, when in fact we are dealing 
with artifacts produced by the non- 
linearity of our measures. 

There are many instances in the 
literature where authors have failed 
to consider the above effects in their 
interpretation of results based on 
time-on-target scores. In order to 
call attention to this problem we have 
chosen two reports of work done in 
our own laboratory. 

In a recent paper (9) which evalu- 
ates the effect of stimulus and re- 
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sponse amplitudes upon tracking per- 
formance, a 5% time-on-target score, 
covering an error voltage range of 
+.325 v., was used. In discussing 
the interactions of stimulus and re- 
sponse amplitude upon performance, 
the authors conclude from statistical 
tests of their scores that as the stimu- 
lus amplitude was magnified, Ss 
found it increasingly advantageous to 
use a large response motion. This 
conclusion appears quite reasonable 
if we examine in their Fig. 4 (9, p. 
86) the progressive separation among 
response-amplitude curves for in- 
creasing values of stimulus amplitude. 
However, the small stimulus ampli- 
tudes resulted in scores of only 20% 
to 30% time on target, while time-on- 
target scores as high as 50% to 60% 
were achieved under conditions of 
large stimulus amplitude. If we now 
refer to Fig. 4 of the present article it 
can be seen that the acquisition curve 
for a time-on-target zone that pro- 
vides scores near 50% on target is 
very sensitive to variations of RMS, 
whereas a time-on-target zone giving 
scores in the range near 20% is rela- 
tively insensitive to identical varia- 
tions in RMS error. Thus, assuming 
a normal distribution of error ampli- 
tudes, slight variations in the RMS 
value of the error would produce 
large effects on the time-on-target 
score if the stimulus amplitude is 
large, but only small effects when the 
stimulus amplitude is small. The 
range of RMS values occurring in this 
study varied from about .4 to 1 v. If 
a larger target zone had been used, for 
example, one covering +.75 v., it is 
possible that the statistical analyses 
would again have shown a significant 
interaction, but the obtained curves 
would have shown more separation 
for small than for large stimulus 
amplitudes and the authors would 
have been forced to make an opposite 
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conclusion regarding the direction of 
the interaction effect among stimulus 
and response amplitudes. 

In this same study the absence of 
significant stimulus- and response- 
amplitude effects upon performance 
in the compensatory version of the 
task may have arisen because of a 
similar artifact of scoring. It can be 
seen in Fig. 4 (9, p. 86) that time-on- 
target scores under the compensatory 
30-plus-20-plus-10-cpm-frequency 
condition did not greatly exceed 10%. 
Inspection of Fig. 4 in the present ar- 
ticle shows that the curve for a target 
zone giving 10% time-on-target scores 
has extremely poor sensitivity in 
terms of the RMS criterion. Using 
their particular target zone for this 
version of their task, it would be diffi- 
cult to demonstrate the effect of any 
independent variable upon perform- 
ance. We do not believe, therefore, 
that a comparison of the relative ef- 
fect of amplitude factors on compen- 
satory vs. pursuit versions of the 
tracking task is possible on this basis. 
Thus, whereas the authors were care- 
ful to use the same criterion measure 
(a particular time-on-target zone) for 
all of their task variations, the very 
use of a standard measure, which was 
differentially sensitive to tasks of 
varying difficulty, limits the validity 
of some of their conclusions. This 
should, however, not be interpreted 
as a criticism of the major findings of 
the study which do not depend upon 
assumptions of linearity. 

In another study (12) evaluating 
the effects of control loading upon 
tracking performance, a similar effect 
can be observed. In summarizing 
their results the authors conclude 
that the differential effects of control 
loading upon performance seem to in- 
crease during the first 20 practice ses- 
sions. We have reproduced their Fig. 
2 (12, p. 355) as Fig. 7 of the present 
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report. Inspection of Fig. 7 indicates 
that the various curves show increas- 
ing separation as practice progresses, 
and this fact forms the basis for the 
above conclusion of the authors. 
However, the initial performance 
level is only about 12% time on tar- 
get, a range in which time-on-target 
scores are insensitive, while the 
terminal performance level is one 
which brings the scores into a much 
more sensitive range of the perform- 
ance measure. To illustrate that this 
increasing separation among learning 
curves is an artifact of the gradually 
increasing sensitivity of the scoring 
zone, the curves have been replotted 
in Fig. 8 for a larger scoring zone. 

The conversion of scores is based 
upon the functions shown in Fig. 4, 
and involves the assumption of a nor- 
mal distribution of error amplitudes. 
It can be seen that the increasing 
separation among curves is no longer 
present in Fig. 8. Thus the authors 
have attributed an artifact produced 
by the arbitrary selection of target 
size to their independent variable, 
and they might have reached an op- 
posite conclusion had a different tar- 
get zone been selected. 
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IMPLICATIONS FOR PERFORMANCE 
MEASUREMENT 


It should be pointed out that the 
nonlinear relation between RMS and 
time-on-target scores does not invali- 
date all use of the latter scores. For 
certain gross comparisons, intended 


only to determine the presence or 
absence of a significant effect, either 


type of score may be adequate. In- 
deed, the two types of scores would 
be expected, and have been found (5, 
6, 7) to correlate rather highly. Arti- 
facts in the interpretation of results 
occur primarily when attempts are 
made to test for interaction effects 
or to interpret functional relations 
over an extended range of task diffi- 
culty or over an extended period of 
learning. 

Thus it would appear that a single 
target zone can provide a score of 
only limited value as an indicant of 
tracking performance. This is par- 
ticularly true if performance on dif- 
ferent tasks or at different stages of 
learning varies over a wide range, so 
that the percentage of time on target 
is either very low, or very high for 
some of the conditions to be evalu- 
ated. 
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Simultaneous recording of scores 
for several target widths is 
method of obtaining performance 
records which are less likely to lead 
to confounding of the effects pro- 
duced by independent variables and 
the scoring mechanics (although per- 
haps this presents FE with a difficult 
choice of functions). Another possi- 
ble approach would involve trans- 
formation of time-on-target scores to 
yield an estimate of the RMS score, 
by constructing a plot such as that 
shown in Fig. 3, or by direct reference 
to a table of the normal curve. This 
procedure may be of limited value, 
however, excessive de- 
mands upon the reliability of the 
scoring apparatus near the extremes 
of the scale. 

The use of the RMS measure itself 
appears to us as the best method of 
avoiding the problems discussed in 
this paper. The best single statistic 
(in addition to the mean) for describ- 
ing a normal or near-normal distribu- 
tion is generally accepted to be the 
SD. Furthermore, this score does not 
impose an artificial ceiling upon im- 
provement as do the time-on-target 
scores. Other advantages lie in the 
fact that the RMS value provides a 
score equally useful for problems of 
all difficulty levels, and that the 
measure reflects the entire distribu- 
tion of error amplitudes, rather than 
just a dichotomized version of the 
distribution. The selection of this 
measure is, of course, also arbitrary 
in one RMS does not 
change linearly as a function of prac- 
tice. The lack of true scales for the 
measurement of learning, and the 
consequent difficulty of comparing 
variability at different stages of prac- 
tice has been pointed out before (11, 
p. 635), and the use of the RMS meas- 
ure does not solve these problems. 
The advantage of the RMS measure 


one 


because of 


sense, since 





LEARNING CURVES—FACTS OR ARTIFACTS? 


simply lies in the substitution of a 
single function for an unlimited num- 
ber of functions determined by all 
possible target dimensions. As a con- 
sequence, there result advantages of 
comparability of data and ease of 
interpretation. If the RMS score is 
computed with respect to zero error 
equals perfect performance, rather 
than with respect to zero equals S's 
own mean, then the score will reflect 
constant error as well 
error (i.e., MS ores error — M Sveariadic error 
+ MSconstanterror) It would appear 
from the amplitude distributions pre- 
sented here that such constant errors 
are relatively minor and can usually 
be disregarded. This is likely to be 
true in most studies of continuous 
tracking, where lead or lag of the 
cursor relative to the target will each 
result in positive and negative volt- 
age errors depending upon the mo- 
mentary direction of motion. How- 


as variable 


ever, the mean plus or minus error 


can usually be determined by the use 
of slightly more complex scoring 
equipment and the RMS score can 
then be reduced to the variable error 
in pure form. Thus, we can conclude 
that the best single measure of track- 
ing performance is error RMS (or 
perhaps simply mean error). A more 
complete picture of performance can 
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be gained by recording the complete 
amplitude distribution of error. 

Although the present analysis has 
dealt only with the measurement of 
tracking performance, the conclu- 
sions have much wider implications 
for psychology. Similar problems ex- 
ist wherever response characteristics 
follow a continuous and normal dis- 
tribution and where learning results 
in diminished variance of this distri- 
bution, but performance is scored 
according to an all-or-none criterion 
of frequency of occurrence. Not only 
are performance scores in tracking 
tasks, such as that provided by the 
rotary pursuit apparatus, subject to 
artifacts arising from the arbitrary 
choice of the size of the target zone, 
but so are scores in many other tasks 
such as steadiness tests, dotting tests, 
tweezer dexterity tests, pegboard 
tests, etc., where success is scored 
against an all-or-none criterion. It 
even appears likely that the records 
of many other types of behavior, in- 
cluding such diverse 
conditioned eyelid responses, leg flex- 
ion, and maze turning, which are re- 
corded in terms of frequency of occur- 
rence, may show similar artifacts pro- 
vided the underlying habit strength 
varies as a continuous function of 
practice. 


responses as 
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